machine learning in production with scikit-learn

Machine Learning in Production with scikit-learnJeff Klukas - Data Engineer at Simple

• What’s the problem we’re solving?

• Why machine learning?

• Walkthrough of developing the model

• ✨ Live demo ✨

• Complications of moving this workflow to production

• Other potential approaches

Overview

Categorizing chats# SELECT subject, body, category FROM chats;

subject | body | category

--------------+---------------------------+----------------

Check deposit | Hi how are you? I was… | education

Lost Card | Can you send me a new… | urgent my transfer | My transfer of $10 isn’t… | education

Mail deposits | I have a large check… | education

urgent, customer education, new product, incidents, other

✨✨

✨ ✨

💖💖

Machine Learning

sklearn.pipelinefrom sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import ( CountVectorizer, TfidfTransformer) from xgboost import XGBClassifier

stopwords, lemmatizer = …

pipeline = Pipeline([ ('preprocess', MessagePreprocessor(subject_weight=2)), ('text', TextProcessor(stopwords, lemmatizer)), ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', XGBClassifier(objective='multi:softmax')), ])

Training the model

import pandas as pd

data_frame = pd.read_sql(redshift_connection, "SELECT category, subject, body FROM chats;")

X = data_frame[['subject', 'body']] y = data_frame['category']

X_train, X_test, y_train, y_test = \ train_test_split(X, y, test_size=0.33, random_state=0)

pipeline.fit(X_train, y_train)

Overfitting

https://en.wikipedia.org/wiki/Overfitting

Testing the model

from sklearn.metrics import classification_report

y_predicted = pipeline.predict(X_test)

print(classification_report(y_test, y_predicted))

precision recall f1-score support

class 0 0.67 1.00 0.80 2 class 1 0.00 0.00 0.00 1 class 2 1.00 0.50 0.67 2

avg / total 0.67 0.60 0.59 5

Serving the model in Flaskfrom flask import route, jsonify, request

@route('/chat-classification-api/messages', methods=['POST']) def classify_messages(): """Classify given chat messages""" messages = request.get_json()

y = pipeline.predict(messages)

# join class labels back with identifiers predictions = [{"chat_id": message["chat_id"], "class_label": label} for message, label in zip(messages, y)]

return jsonify(predictions)

Live Demo

How do we take this to production?

Step 1Separate training and serving

Model Persistenceimport pickle import boto3

def write_to_s3(pipeline, key, bucket): s3_client = boto3.client("s3") kms_client = boto3.client("kms")

pkl = pickle.dumps(pipeline) enc_pkl = my_encrypt_function(pkl, kms_client)

s3_client.put_object(Bucket=s3_bucket, Key=key, Body=enc_pkl, ServerSideEncryption="AES256")

Model Persistenceimport pickle import boto3 from flask import current_app

def load_message_classifier(app): conf = app.config["MESSAGE_CLASSIFIER"]

s3_client = boto3.client("s3") kms_client = boto3.client("kms")

resp = s3_client.get_object(Bucket=conf[“bucket"], Key=conf["path"]) untrusted_bytes = resp["Body"].read() pkl = decrypt(untrusted_bytes, kms_client)

with app.app_context(): current_app._message_classifier = pickle.loads(pkl)

Step 2Provide an environment for batch training and evaluation

Optimizing Parameter Values

from sklearn.model_selection import GridSearchCV

params = { 'preprocess__subject_weight': (1, 2, 3, 4, 5), 'text__stopwords': ([], IGNORE, PUNCTUATION), 'vect__max_df': (0.5, 0.75, 1.0), 'vect__ngram_range': ((1, 1), (1, 2)), 'tfidf__use_idf': (True, False), 'tfidf__norm': ('l1', 'l2'), }

search = GridSearchCV(pipeline, params) search.fit(X_train, y_train)

Step 3Monitor performance, adapt to production load, degrade gracefully

Other Approaches

• How big is your team?

• How large of a problem space do you need to cover?

• What is your existing stack?

Considerations

Off-the-Shelf

• Train and test in a batch environment

• Output serialized model and classification report

• sklearn.pipeline is convenient for storing code+params

• Serve on-demand predictions separately

• Treat this like any production service

Thank You

Questions ✨💖

Machine Learning in Production with scikit-learnJeff Klukas - Data Engineer at Simple

machine learning in production with scikit-learn

Technology

introduction to machine learning with scikit-learn

scikit learn: machine learning in...

machine learning with python/scikit-learn -...

machine learning for neuroimaging with scikit-learn ·...

exploring machine learning in python with scikit-learn

machine learning with mllib and scikit-learn · machine...

applied machine learning in python with scikit-learn -...

machine learning com python e scikit-learn

machine learning for neuroimaging with scikit-learn ·...

machine learning in python with scikit-learn · outline •...

a beginner's guide to machine learning with scikit-learn

introduction to machine learning in python using...

intro to machine learning with scikit learn

scikit learn: machine learning in python -...

scikit-learn · scikit-learn (sklearn) is the most useful...

data science and machine learning using python and...

scikit-learn: machine learning in python

202001 - from scikit-learn to sagemaker in multilabel text...

introduction to machine learning with python and...

scikit learn: machine learning in...