data product architectures

39
Data Product Architectures Benjamin Bengfort @bbengfort District Data Labs

Upload: benjamin-bengfort

Post on 24-Jan-2017

1.957 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data Product Architectures

Data Product Architectures

Benjamin Bengfort@bbengfort

District Data Labs

Page 2: Data Product Architectures

Abstract

Page 3: Data Product Architectures

What is data science?Or what is the goal of data science?

Or why do they pay us so much?

Page 4: Data Product Architectures

Two Objectives Orient Data Science to Users

Page 5: Data Product Architectures

Data Products are self-adapting, broadly applicable software-based engines that derive their value from data and generate more data by influencing human behavior or by making inferences or predictions upon new data.

Page 6: Data Product Architectures

Data Products are Applications that

Employ Many Machine Learning Models

Page 7: Data Product Architectures

Data Report

Page 8: Data Product Architectures

Without Feedback Models are Disconnected

They cannot adapt, tune, or react.

Page 9: Data Product Architectures
Page 10: Data Product Architectures
Page 11: Data Product Architectures

Data Products aren’t single models

So how do we architect data products?

Page 12: Data Product Architectures

The Lambda Architecture

Page 13: Data Product Architectures
Page 14: Data Product Architectures

Three Case Studies

Page 15: Data Product Architectures

Analyst Architecture

Page 16: Data Product Architectures

Analyst Architecture: Document Review

Page 17: Data Product Architectures

Analyst Architecture: Triggers

Page 18: Data Product Architectures

Recommender Architecture

Page 19: Data Product Architectures

Recommender: Annotation Service

Page 20: Data Product Architectures

Partisan Discourse Architecture

Page 21: Data Product Architectures

Partisan Discourse: Adding Documents

Page 22: Data Product Architectures

Partisan Discourse: Documents

Page 23: Data Product Architectures

Partisan Discourse: User Specific Models

Page 24: Data Product Architectures

Commonalities?

Page 25: Data Product Architectures
Page 26: Data Product Architectures

Microservices Architecture: Smart Endpoints, Dumb Pipe

HTTP

HTTP

HTTPHTTP

HTTPHTTP

HTTP

Stateful Services

Database Backed Services

Page 27: Data Product Architectures
Page 28: Data Product Architectures

Django Application Model

Page 29: Data Product Architectures

Class Based, Definitional Programming

from rest_framework import viewsets

class InstanceViewSet(viewsets.ModelViewSet):

queryset = Instance.objects.all()

serializer_class = InstanceSerializer

def list(self, request):

pass

def create(self, request):

pass

def retrieve(self, request, pk=None):

pass

def update(self, request, pk=None):

pass

def destroy(self, request, pk=None):

pass

from django.db import models

from rest_framework import serializers as rf

class InstanceSerializer(rf.ModelSerializer):

prediction = rf.CharField(read_only=True)

class Meta:

model = Instance

fields = ('color', 'shape', 'amount')

class Instance(models.Model):

SHAPES = ('square', 'triangle', 'circle')

color = models.CharField(default='red')

shape = models.CharField(choices=SHAPES)

amount = models.IntegerField()

Page 30: Data Product Architectures

Features and Instances as Star Schema

Page 31: Data Product Architectures

REST API Feature Interaction

Page 32: Data Product Architectures

Model (ML) Build Process: Export Instance Table

COPY (

SELECT instances.* FROM instances

JOIN feature on feature.id = instance.id

...

ORDER BY instance.created

LIMIT 10000

) as instances

TO '/tmp/instances.csv' DELIMITER ',' CSV HEADER;

Page 33: Data Product Architectures

Model (ML) Build Process: Build Model

import pandas as pd

from sklearn.svm import SVC

from sklearn.cross_validation import KFold

# Load Data

data = pd.read_csv('/tmp/instances.csv')

scores = []

# Evaluation

folds = KFold(n=len(data), n_folds=12)

for train, test in folds:

model = SVC()

model.fit(data[train])

score = model.score(data[test])

scores.append(score)

# Build the actual model

model = SVC()

model.fit(data)

Page 34: Data Product Architectures

Model (ML) Build Process: Store Model

import json

import pickle

import base64

import datetime

data = pickle.dump(model)

data = base64.base64encode(data)

return {

"model": data,

"created": datetime.datetime.now(),

"form": repr(model),

"name": model.__class__.__name__,

"scores": scores,

}

Page 35: Data Product Architectures

Model Data Storage

from django.db import models

class PredictiveModel(models.Model):

name = models.CharField()

params = models.JSONField()

build = models.FloatField()

f1_score = models.FloatField()

created = models.DateTimeField()

data = models.BinaryField()

Page 36: Data Product Architectures

REST API Model Interaction

featurize()predict()

Models Stored in Memory

Update Annotations

Page 37: Data Product Architectures

Build Data Products!

Page 39: Data Product Architectures

Questions?