machine learning in production with scikit-learn

Post on 21-Jan-2018

445 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Machine Learning in Production with scikit-learnJeff Klukas - Data Engineer at Simple

1

2

3

• What’s the problem we’re solving?

• Why machine learning?

• Walkthrough of developing the model

• ✨ Live demo ✨

• Complications of moving this workflow to production

• Other potential approaches

Overview

4

5

Categorizing chats# SELECT subject, body, category FROM chats;

subject | body | category

--------------+---------------------------+----------------

Check deposit | Hi how are you? I was… | education

Lost Card | Can you send me a new… | urgent my transfer | My transfer of $10 isn’t… | education

Mail deposits | I have a large check… | education

urgent, customer education, new product, incidents, other

6

7

8

✨✨

✨ ✨

💖💖

💖

Machine Learning

💖

9

10

11

sklearn.pipelinefrom sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import ( CountVectorizer, TfidfTransformer) from xgboost import XGBClassifier

stopwords, lemmatizer = …

pipeline = Pipeline([ ('preprocess', MessagePreprocessor(subject_weight=2)), ('text', TextProcessor(stopwords, lemmatizer)), ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', XGBClassifier(objective='multi:softmax')), ])

12

Training the model

import pandas as pd

data_frame = pd.read_sql(redshift_connection, "SELECT category, subject, body FROM chats;")

X = data_frame[['subject', 'body']] y = data_frame['category']

X_train, X_test, y_train, y_test = \ train_test_split(X, y, test_size=0.33, random_state=0)

pipeline.fit(X_train, y_train)

13

Overfitting

https://en.wikipedia.org/wiki/Overfitting

14

Testing the model

from sklearn.metrics import classification_report

y_predicted = pipeline.predict(X_test)

print(classification_report(y_test, y_predicted))

precision recall f1-score support

class 0 0.67 1.00 0.80 2 class 1 0.00 0.00 0.00 1 class 2 1.00 0.50 0.67 2

avg / total 0.67 0.60 0.59 5

15

Serving the model in Flaskfrom flask import route, jsonify, request

@route('/chat-classification-api/messages', methods=['POST']) def classify_messages(): """Classify given chat messages""" messages = request.get_json()

y = pipeline.predict(messages)

# join class labels back with identifiers predictions = [{"chat_id": message["chat_id"], "class_label": label} for message, label in zip(messages, y)]

return jsonify(predictions)

16

Live Demo

17

How do we take this to production?

18

How do we take this to production?

Step 1Separate training and serving

19

20

Model Persistenceimport pickle import boto3

def write_to_s3(pipeline, key, bucket): s3_client = boto3.client("s3") kms_client = boto3.client("kms")

pkl = pickle.dumps(pipeline) enc_pkl = my_encrypt_function(pkl, kms_client)

s3_client.put_object(Bucket=s3_bucket, Key=key, Body=enc_pkl, ServerSideEncryption="AES256")

21

Model Persistenceimport pickle import boto3 from flask import current_app

def load_message_classifier(app): conf = app.config["MESSAGE_CLASSIFIER"]

s3_client = boto3.client("s3") kms_client = boto3.client("kms")

resp = s3_client.get_object(Bucket=conf[“bucket"], Key=conf["path"]) untrusted_bytes = resp["Body"].read() pkl = decrypt(untrusted_bytes, kms_client)

with app.app_context(): current_app._message_classifier = pickle.loads(pkl)

Step 2Provide an environment for batch training and evaluation

22

23

Optimizing Parameter Values

from sklearn.model_selection import GridSearchCV

params = { 'preprocess__subject_weight': (1, 2, 3, 4, 5), 'text__stopwords': ([], IGNORE, PUNCTUATION), 'vect__max_df': (0.5, 0.75, 1.0), 'vect__ngram_range': ((1, 1), (1, 2)), 'tfidf__use_idf': (True, False), 'tfidf__norm': ('l1', 'l2'), }

search = GridSearchCV(pipeline, params) search.fit(X_train, y_train)

Step 3Monitor performance, adapt to production load, degrade gracefully

24

Other Approaches

25

26

• How big is your team?

• How large of a problem space do you need to cover?

• What is your existing stack?

Considerations

27

Off-the-Shelf

28

Off-the-Shelf

29

Off-the-Shelf

30

• Train and test in a batch environment

• Output serialized model and classification report

• sklearn.pipeline is convenient for storing code+params

• Serve on-demand predictions separately

• Treat this like any production service

Recap

Thank You

31

32

Questions ✨💖

Machine Learning in Production with scikit-learnJeff Klukas - Data Engineer at Simple

33

top related