machine learning engineer @ twitter cortex cibele montez...

Deep Learning at Twitter's Scale

Cibele Montez HalaszMachine Learning Engineer @ Twitter Cortex

October 11, 2018

BackgroundWorkflow/PlatformModeling and Optimizations Additional Performance GainsGPU

Agenda

Background

● Challenges: Characteristics of Platform

● Data Shift

Sparsity of Data

Along with more body copy goes here.

Challenges

Source: Luca Belli/Dan Shiebler, June, 2018

SPARSE

Challenges

Source: Jacob Kastrenakes, The Verge, July, 2018

Data Shift

Machine Learning at the Company

● Environment● Modeling: some use cases

Environment

Environment: ML Overview

Team A’s Data Aggregation and Feature Extraction Job

Cortex Embedding Generation Pipeline

Feature Registry

Team A’s Machine Learning Model

Team B’s Machine Learning Model

Team C’s Machine Learning Model

Environment

Environment: ML Workflows

Source: Devin Goodsell, NYC Machine Learning Meetup, June 20, 2018

Environment

Environment: ML Training

Environment

Environment: Priorities

● Feature Addition → Scalable data ● Data Addition → Scalable data ● Training → Fast, robust training engine ● Deployment → Seamless and tested ML services ● A/B test → Good AB test environment

Modeling: Modeling and Optimizations with TensorFlow

Modeling

Discretizer Full Sparse Dense MLP Full Sparse

Modeling

Sparse Linear Layer

1 2 K...

V1 V2 Vn-2 Vn-1 V

Modeling

Sparse Linear Layer: First Approach

Source: Tensorflow

Modeling

Sparse Linear Layer: Final Approach

Source: Tensorflow

Modeling

Sparse Linear Layer

● Optimizers

○ SGD○ Lazy Adam

Source: Berkeley Research Artificial Lab

Modeling

Sparse Linear Layer: Variable Partitioning

output:

Modeling

output:

Modeling

output:

Modeling

Sparse Linear Layer: Variable Partitioning: Profiling

~33% reduction

Modeling

Sparse Linear Layer: Online Normalization

● Example:

Source: Nicolas Koumchatzky

Input: input_feature (value == 1M)⇒ weight_gradient == 1M⇒ update = 1M * learning_rate⇒ ?

Modeling

● Example:

Input: input_feature (value == 1M)⇒ weight_gradient == 1M⇒ update = 1M * learning_rate⇒

Modeling

● Normalization of input values

Belongs to [-1, 1]Trainable per-feature bias: discriminate absence and presence of features

Additional Performance Gains

Performance Gains

Hogwild

Source: Hogwild!, UC Berkeley

Performance Gains

Hogwild

Source: Tensorflow

Performance Gains

Custom Ops

GPU x CPU metrics

GPU Benchmarks

Batch Size/Model Optimization

CPU: Baseline(tf.sparse_tensor_dense_matmul)

CPU: After optimizations

GPU: Baseline(tf.sparse_tensor_dense_matmul)

GPU: After Optimizations

256 1024 samples/s 7372 samples/s 5504 samples/s 22528 samples/s

GPU benchmarks were run with NVIDIA Tesla K80 ProcessorsCPU benchmarks were run with Intel Xenon Platinum 8180 Processors

Acknowledgements

● Andrew Bean● Ricardo Cervera-Navarro● Priyank Jain● Ruhua Jiang● Nicholas Léonard● Briac Marcatté● Mahak Patidar● Tim Sweeney● Pavan Yalamanchili● Yi Zhuang

Thank you!Questions?

Follow me on Twitter: @cibelemhEmail me at: cmontezhalasz@twitter.com

machine learning engineer @ twitter cortex cibele montez...

Documents

twitter's new look: a pov on the redesign

twitter's samir bhana: real time marketing in the moment

get the most out of twitter's new layout

twitter's top profiles and brands in ksa, uae, jordan &...

twitter commerce - twitter's plan for e commerce

pdf - cibele frança henrique

pétalas plásticas - cibele mojas

cibele mendes 4.pdf

retweet this! 20 life lessons from twitter's co-founder.pdf

twitter's and @douenislands' ambiguous paths

cibele 2013

cibele mendes 6.pdf

tweeting tips from twitter's leader of social innovation,...

twitter's redesign - survival of the fittest

twitter's s-1

how jira service desk saved twitter's global help desk -...

a closer look at twitter's user funnel issue

twitter's impact on radio (us)

power point de cibele

cibele encontra o sacerdote maligno