willem pienaar data science platform lead go-jek · 2018-10-02 · features develop model fix...

35
Building a feature platform to scale machine learning Willem Pienaar Data Science Platform Lead GO-JEK

Upload: others

Post on 03-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Building a feature platform to scale machine learningWillem Pienaar

Data Science Platform LeadGO-JEK

Page 2: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science
Page 3: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Our Growth

2011

Call-center for ojek* services

2015 2016 2017 2018

?

Page 4: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Our Home

Operating in 60 cities throughout Indonesia

80mapp downloads

+200kmerchants

60cities

1m+drivers

100m+monthly bookings

Manado

Medan

Batam

Bali

Makassar

Jakarta

Palembang Balikpapan

Page 5: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Our Data

Page 6: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Where do data scientists spend their time?

explore data

find datainstallsoftwareprovision

infra wait

automateinfra

make features

develop model

fix pipeline

develop service

test

deployintegrate

evaluate

Not Data Science Not Data Science

Page 7: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

The problems with feature engineering

■ Pipeline jungles

■ Data processing did not scale

■ Real-time features required engineers

■ Inconsistency between training and serving

■ Lack of discovery

■ Lack of standardization

Page 8: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

What should a feature platform allow us to do?

■ Standardize feature definitions

■ Provide a means for creating batch and streaming features

■ Allow us to create datasets to train our ML models

■ Allow model services to access features in production

■ Allow for the discovery of features

■ Abstract away engineering

Page 9: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

On trip

High ETA

Heading to home area

Customer

Selected driver

Which driver should we send?

Page 10: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Which features do we need?

Input Features

■ Driver location, speed, direction,

and ETA to customer

■ Time of day, day of week

■ Demand, supply

■ Origin, destination

■ Customer profile, actions

■ And hundreds of other features

Page 11: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Which features types should we support?

Feature types

■ Entity (driver, customer, area)

■ Process (batch, real-time)

■ Granularity (day, hour, minute, second)

■ Value type (primitive, complex)

Page 12: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Feature specificationsname: booking_countowner: [email protected]: Total number of booking created per day per driveruri: https://yourdomain.com/organization/feature-transforms/driver-feature-pipelinegranularity: DAYvalueType: INT64entity: drivertags: - driver - booking - streamingdataStores: serving: id: driver_serving warehouse: id: driver_warehouse

Page 13: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Two types of feature creation processes

Data WarehouseBigQuery

Data LakeCloud Storage

EventsApache Kafka

Batch Features

Feature CreationData sources

Streaming Features

Feature Store

The two main types of feature creation workflows

1. Batch workflow

2. Stream workflow

Page 14: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Publishing batch features

SELECT driver_id, count(price_of_item) as booking_countFROM driver_bookingsWHERE _PARTITIONTIME >= dateadd(@execution_date, -1, day)AND _PARTITIONTIME < @execution_date

Workflow

1. Create parameterised SQL query

Page 15: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Publishing batch features

owner: [email protected]: 2018-07-01catchup: falseinterval: "0 0 * * *"entity: drivergranularity: DAYpath: stable/driver/day/bookings.sqlcolumns:- name: completed_bookings

featureId: driver.day.completed_bookings_v1- name: cancelled_bookings

featureId: driver.day.cancelled_bookings_v1

Workflow

1. Create parameterised SQL query2. Configure creation specification

Page 16: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Publishing batch features

Workflow

1. Create parameterised SQL query2. Configure creation specification3. Push to repo, trigger Clockwork

Feature Transforms (SQL)

Creation Spec (YAML)

Clockwork

Airflow DAG

ClockworkDeclarative Airflow workflows through YAML

Page 17: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

How do we publish real-time features?

Data WarehouseBigQuery

Data LakeCloud Storage

EventsApache Kafka

Feature CreationData sources

Feature Store

Batch TransformsBigQuery SQL

Streaming Features

Page 18: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Publishing streaming features

Apache Beam

■ Unified batch and stream support

■ Consistent feature definition between serving and training

■ Automatic scaling with Dataflow

■ No lock-in

Data WarehouseBigQuery

Data LakeCloud Storage

EventsApache Kafka

Feature CreationData sources

Feature Store

Batch TransformsBigQuery SQL

Feature TransformsCloud Dataflow

Page 19: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Defining a feature in Beam

Beam running on Dataflow

Trip count per driver

trip_count_per_driver = (

trip_events

| 'filter' >> beam.Filter(lambda trip: trip['status'] == 'success')

| 'pair' >> beam.Map(lambda driver: (driver, 1))

| 'sum' >> beam.CombinePerKey(sum))

Trip events

Page 20: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

How will our clients use the feature store?

Data WarehouseBigQuery

Data LakeCloud Storage

EventsApache Kafka

Feature CreationData sources

Feature Store

Batch TransformsBigQuery SQL

Feature TransformsCloud Dataflow

User

Training Pipelines

Model Serving

Page 21: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Training store requirements

■ Scalable big data processing■ Handle large and sparse key spaces■ Joins■ Easy to use

Page 22: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

BigQuery for storing our training data

Training StoreBigQuery

Feature Storage■ No infrastructure

■ Easy to use (SQL)

■ Scales to massive amounts of data

■ Integrated with GCP services

■ Already contains our labelled data

Data WarehouseBigQuery

Data LakeCloud Storage

EventsApache Kafka

Feature CreationData sources

Batch TransformsBigQuery SQL

Feature TransformsCloud Dataflow

Page 23: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Serving store requirements

■ Low latency reads (<10 ms)

■ High throughput reads/writes (150k+ rps)

■ Large volume of feature data (TBs)

■ Persistence

■ Linearly scalable

Page 24: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Bigtable as primary store for serving features

Training StoreBigQuery

Serving StoreCloud Bigtable

Feature StorageBigtable■ 10k RPS read/write per node■ 10 ms latency read/write■ No infrastructure■ Scalable■ Large capacity per node (TB+)■ Persistent

Data WarehouseBigQuery

Data LakeCloud Storage

EventsApache Kafka

Feature CreationData sources

Batch TransformsBigQuery SQL

Feature TransformsCloud Dataflow

Page 25: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Redis as secondary store for serving features

Training StoreBigQuery

Serving StoreCloud Bigtable

Serving StoreCloud Memorystore

Feature StorageRedis■ Extremely high throughput■ Very low latency read/write

Data WarehouseBigQuery

Data LakeCloud Storage

EventsApache Kafka

Feature CreationData sources

Batch TransformsBigQuery SQL

Feature TransformsCloud Dataflow

Page 26: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Build a feature serving API

Feature Serving API

Feature AccessServing API■ Load balancing■ Caching■ Fail-over■ QoS■ Authentication

Training StoreBigQuery

Serving StoreCloud Bigtable

Serving StoreCloud Memorystore

Feature Storage

Data WarehouseBigQuery

Data LakeCloud Storage

EventsApache Kafka

Feature CreationData sources

Batch TransformsBigQuery SQL

Feature TransformsCloud Dataflow

Page 27: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Extending the feature platformname: booking_summaryowner: [email protected]: DAYvalueType: BYTESentity: driverdataStores: serving: id: driver_serving warehouse: id: driver_warehouse

Page 28: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Extending the feature platformname: booking_summaryowner: [email protected]: DAYvalueType: BYTESentity: driveroptions: protoValueType: com.gojek.ds.driver.BookingSummary JSONInWarehouse: TruedataStores: serving: id: driver_serving warehouse: id: driver_warehouse

Will decode bytes into JSON before storing in BigQuery

Page 29: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Extending the feature platformname: booking_countowner: [email protected]: DAYvalueType: INT64entity: driveroptions: minimumDiscreteValue: 0 maximumDiscreteValue: 1000 warnOnFailure: TruedataStores: serving: id: driver_serving warehouse: id: driver_warehouse

Logs a warning if the feature value is outside of a bound

Page 30: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Feature explorer

Page 31: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Feature explorer

What is it used for?

■ Feature discovery

■ Feature management

■ Performance statistics

■ Usage statistics

■ Traceability

■ Job execution

■ System health

Page 32: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Data WarehouseBigQuery

Data LakeCloud Storage

EventsApache Kafka

Feature CreationData sources

Batch TransformsBigQuery SQL

Feature TransformsCloud Dataflow

Feature platform

Feature Serving API

Feature Explorer

Data Scientists

Training StoreBigQuery

Serving StoreCloud Bigtable

Serving StoreCloud Memorystore

Feature Storage Feature Access

Feature lookup

Page 33: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

How are they spending their time now?

explore data

create features

develop model

evaluate

Page 34: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Impact on GO-JEK

Improved customer experience

Less data scientists per customer

Less infrastructure

Faster time to market for ML projects

Page 35: Willem Pienaar Data Science Platform Lead GO-JEK · 2018-10-02 · features develop model fix pipeline develop service test deploy integrate evaluate Not Data Science Not Data Science

Thank [email protected]