infrastructure@slack bing wei - qcon san francisco · infrastructure@slack. 2. 3. our mission: to...

77
Scaling Slack Bing Wei Infrastructure@Slack

Upload: others

Post on 21-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Scaling Slack

Bing WeiInfrastructure@Slack

Page 2: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

2

Page 3: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

3

Page 4: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Our Mission:To make people’s workinglives simpler, more pleasant,and more productive.

4

Page 5: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

From supporting small teams

To serving gigantic organizations of hundreds of thousands of users

5

Page 6: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Slack Scale

◈ 6M+ DAU, 9M+ WAU5M+ peak simultaneously connected

◈ Avg 10+ hrs/weekday connectedAvg 2+ hrs/weekday in active use

◈ 55% of DAU outside of US

6

Page 7: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Cartoon Architecture

WebAppPHP/Hack

ShardedMySql

Messaging Server Java

Job QueueRedis/Kafka

HTTP

WebSocket

7

Page 8: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Outline

◈ The slowness problem◈ Incremental Improvements ◈ Architecture changes

○ Flannel○ Client Pub/Sub○ Messaging Server

8

Page 9: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Challenge: Slowness Connecting to Slack

9

Page 10: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Login Flow in 2015

User1. HTTP POST with user’s token

2. HTTP Response: a snapshot of the team & websocket URL

WebApp MySql

10

Page 11: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Some examples

number of users/channels

response size

30 / 10 200K

500 / 200 2.5M

3K / 7K 20M

30K / 1K 60M

11

Page 12: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Login flow in 2015

User1. HTTP POST with user’s token

2. HTTP Response: a snapshot of the team & websocket url

WebApp

Messaging Server3. Websocket: real-time events

MySql

12

Page 13: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Real-time Events on WebSocket

UserMessaging Server

WebSocket: 100+ types of events

e.g. chat messages,typing indicator,files uploads, files comments,threads replies,user presence changes,user profile changes, reactions, pins, stars,channel creations,app installations,etc.

13

Page 14: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Login Flow in 2015

◈ Clients Architecture○ Download a snapshot of entire team○ Updates trickle in through the WebSocket○ Eventually consistent snapshot of whole team

14

Page 15: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Problems

Initial team snapshot takes time

15

Page 16: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Problems

Initial team snapshot takes timeLarge client memory footprint

16

Page 17: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Problems

Initial team snapshot takes timeLarge client memory footprint

Expensive reconnections

17

Page 18: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Problems

Initial team snapshot takes timeLarge client memory footprint

Expensive reconnectionsReconnect storm

18

Page 19: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Outline

◈ The slowness problem◈ Incremental Improvements ◈ Architecture changes

○ Flannel○ Client Pub/Sub○ Messaging Server

19

Page 20: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Improvements

◈ Smaller team snapshot○ Client local storage + delta○ Remove objects out + in parallel loading○ Simplified objects: e.g. canonical Avatars

20

Page 21: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Improvements

◈ Incremental boot○ Load one channel first

21

Page 22: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Improvements

◈ Rate Limit◈ POPs ◈ Load Testing Framework

22

Page 23: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Support New Product Features

Product Launch

23

Page 24: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Cope with New Product Features

Product Launch

24

Page 25: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Still...

Limitations ◈ What if team sizes keep growing◈ Outages when clients dump their local

storage

25

Page 26: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Outline

◈ The slowness problem◈ Incremental Improvements ◈ Architecture changes

○ Flannel○ Client Pub/Sub○ Messaging Server

26

Page 27: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Client Lazy Loading

Download less data upfront◈ Fetch more on demand

27

Page 28: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Flannel: Edge Cache Service

A query engine backed by cache on edge locations

28

Page 29: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

What are in Flannel’s cache

◈ Support big objects first ○ Users○ Channels Membership○ Channels

29

Page 30: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Login and Message Flow with Flannel

User Messaging Server

WebApp MySQLFlannel

3. WebSocket: Stream Json events

1. WebSocket connection

2. HTTP Post:download a snapshotof the team

30

Page 31: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

A Man in the Middle

User

Messaging Server Flannel

Use real-time events to update its cacheE.g. user creation,

user profile change,

channel creation,

user joins a channel,

channel convert to private

WebSocket WebSocket

31

Page 32: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Edge Locations

Mix of AWS & Google Cloud

main regionus-east-1

32

Page 33: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Examples Powered by Flannel

Quick Switcher

33

Page 34: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Examples Powered by Flannel

MentionSuggestion

34

Page 35: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Examples Powered by Flannel

ChannelHeader

35

Page 36: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Examples Powered by Flannel

ChannelSidebar

36

Page 37: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Examples Powered by Flannel

TeamDirectory

37

Page 38: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Flannel Results

◈ Launched Jan 2017○ Load 200K user team

◈ 5M+ simultaneous connections at peak

◈ 1M+ clients queries/sec

38

Page 39: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Flannel Results

39

Page 40: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

This is not the end of the story

40

Page 41: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Evolution of Flannel

41

Page 42: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Web Client Iterations

Flannel Just-In-Time Annotation

Right before Web clients are about to access an object, Flannel pushes that object to clients.

42

Page 43: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

A Closer Look

Why does Flannel sit on WebSocket?

43

Page 44: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Old Way of Cache Updates

Users

Messaging Server Flannel

LOTS of duplicated Json events

44

Page 45: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Publish/Subscribe (Pub/Sub) to Update Cache

Users

Messaging Server Flannel

Pub/Sub Thrift events

45

Page 46: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Pub/Sub Benefits

Less Flannel CPUSimpler Flannel code

Schema dataFlexibility for cache management

46

Page 47: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Flexibility for Cache Management

Previously◈ Load when the first user connects◈ Unload when the last user disconnects

47

Page 48: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Flexibility for Cache Management

With Pub/Sub◈ Isolate received events from user

connections

48

Page 49: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Another Closer Look

◈ With Pub/Sub, does Flannel need to be on WebSocket path?

49

Page 50: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Next Step

Move Flannel out of WebSocket path

50

Page 51: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Next Step

Move Flannel out of WebSocket path

Why?Separation & Flexibility

51

Page 52: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Evolution with Product Requirements

Grid for Big Enterprise

52

Page 53: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Team Affinity for Cache Efficiency

Before Grid

53

Page 54: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Team Affinity

Grid Aware

Now

54

Page 55: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Grid Awareness Improvements

Flannel Memory

Saves 22G of per host, 1.1TB total 55

Page 56: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Grid Awareness Improvements

DB Shard CPU Idle 25% -> 90% P99 User Connect Latency 40s -> 4s

For our biggest customer

56

Page 57: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Team Affinity

Grid Aware

Scatter & Gather

Future

57

Page 58: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Outline

◈ The slowness problem◈ Incremental Improvements ◈ Architecture changes

○ Flannel○ Client Pub/Sub○ Messaging Server

58

Page 59: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Expand Pub/Sub to Client Side

Client Side Pub/Subreduces events Clients have to handle

59

Page 60: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Presence Events

60% of all events

O(N2)1000 user team ⇒1000 * 1000 = 1M events

60

Page 61: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Presence Pub/Sub

Clients◈ Track who are in the

current view◈ Subscribe/Unsubscribe to

Messaging server when view changes

61

Page 62: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Outline

◈ The slowness problem◈ Incremental Improvements ◈ Architecture changes

○ Flannel○ Client Pub/Sub○ Messaging Server

62

Page 63: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

What is Messaging Server

Messaging Server

63

Page 64: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

A Message Router

Messaging Server

64

Page 65: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Events Routing and Fanout

Messaging ServerWebApp/

DB

1.Events happen on team

2.Events Fanout

65

Page 66: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Limitations

◈ Sharded by TeamSingle point of failure

66

Page 67: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Limitations

◈ Sharded by TeamSingle point of failure

◈ Shared ChannelsShared states among teams

67

Page 68: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

68

Page 69: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Topic Sharding

Everything is a Topicpublic/private channel, DM, group DM,

user, team, grid

69

Page 70: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Topic Sharding

Natural fit for shared channels

70

Page 71: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Topic Sharding

Natural fit for shared channelsReduce user perceived failures

71

Page 72: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Other Improvements

Auto failure recovery

72

Page 73: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Other Improvements

Auto failure recoveryPublish/Subscribe

73

Page 74: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Other Improvements

Auto failure recoveryPublish/Subscribe Fanout at the edge

74

Page 75: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Our Journey

ProblemIncremental

ChangeArchitectural

ChangeOngoing Evolution

75

Page 76: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

More To ComeJourney Ahead

Get in touch:https://slack.com/jobs

76

Page 77: Infrastructure@Slack Bing Wei - QCon San Francisco · Infrastructure@Slack. 2. 3. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4. From

Thanks!

Any questions?@bingw11

77