how slack works - qcon san francisco...a: it sort of is, but it also works well. q: i’m skeptical....

45
How Slack Works Keith Adams [email protected] @keithmadams facebook.com/kma

Upload: others

Post on 14-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

How Slack WorksKeith Adams

[email protected] @keithmadams facebook.com/kma

Page 2: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

What is

Slack?

Page 3: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

What is

Slack?Voice Calls! Platform! Something about Bots!!

Page 4: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

PersistentGroup

Messaging

But first it was a

Service

Page 5: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

In this talk

● How Slack works today➞ Application logic➞ Persistence➞ Real-time messaging➞ Deferring work for later

● Problems● What we’re doing about them

Page 6: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Also in this talk

● Flaws● Challenges● Mistakes● Dead-ends● Future directions

Page 7: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Slack Scale

● 4M DAU, 5.8M WAUPeak simultaneous connected: 2.5M

● > 2H / weekday for each active user> 10H / weekday connected

● Half of DAU outside US

Page 8: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Slack House Style

● Conservative technical tasteMost supporting technologies are >10 years old

● Willing to write a little codeChoose low coupling, fitness-to-purpose over DRY

● MinimalismChoose something we already operate over something new and tailor-madeShallow, transparent stack of abstractions

Page 9: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Cartoon Architecture of Slack

MySQL

Job Queue

Message Server

WebApp

Page 10: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Case Study: Login and Receive Messages

slack.com

POST /api/rtm.start?token=xoxo--&...

Page 11: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Slack’s webapp codebase

● PHP monolith of app logic<1MLoC

● Scaled-out LAMP stack appMemcache wrapped around sharded MySQL

● Recently migrated to HHVMPerformance, hacklang

Page 12: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

World’s shortest PHP-at-Slack FAQ

● Q: I hear/believe/have experienced PHP to be terrible.A: It sort of is, but it also works well.

● Q: I’m skeptical.A: You’re in good company! Check out this blog post. But we should probably get on with the talk at hand ...

● Q: Sounds good.A: Right-o.

Page 13: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Login and Receive Messages: the “mains”

slack.com main0

main1

SELECT db_shard FROM teams WHERE domain = %domain

Page 14: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Login and Receive Messages: the shards

slack.commain0

main1

main0

main1

main0

main1

main0

main1

Shard123a

Shard123b

SELECT * FROM channelsWHERE team_id = 711 ...

Page 15: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

MySQL Shards

● Source of truth for most customer dataTeams, users, channels, messages, comments, emoji, ...

● Replication across two DCsAvailable for 1-DC failure

● Sharded by teamFor performance, fault isolation, and scalability

Page 16: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Why MySQL?

● Many, many thousands of server-years of working● The relational model is a good discipline● Experience● Tooling

Not because of ACID, though

Page 17: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Master-Master Replication

www1 Shard123a

Shard123b

www17

Page 18: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

MMR Complications

● Choosing A in CAP terms● Conflicts are possible

➞ Most resolved automatically➞ Some manually, by operator action(!)

● INSERT ON DUPLICATE KEY UPDATE … ● Partitioning by team saves us

➞ Team writes cannot overlap➞ Even teams use “left” head, odd teams use “right” head

Page 19: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Case Study: Login and Receive Messages

slack.com{ “ok”: true, “url”: “wss:\/\/ms9.slack-msgs.com\/websocket\/7I5yBpcvk”, …}

Page 20: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Rtm.start payload

● Rtm.start returns an image of the whole team● Architecture of clients

➞ Eventually consistent snapshot of whole team➞ Updates trickle in through the web socket

● Guarantees responsive clients● ...once connection is established

Page 21: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Cartoon Architecture of Slack

MySQL

Job Queue

Message Server

WebApp

Page 22: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Persist, broadcast messages

Message Delivery

Message Server

WebApp

Page 23: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Wrinkles in Message Server

● Race between rtm.start and connection to MS➞ Event log mechanism

● Glitches, delays, net partitions while persisting➞ In-memory queue of pending sends➞ Queue depth sensitive barometer of system health

● Most messages are presence

Page 24: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Link unfurling

Deferring Work

Search indexing

Exports/Imports

Job Queue (Redis)

WebApp

Job Workers

Page 25: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Putting it all together

mains

shards

Message Server

WebApp

Page 26: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Things missing from the cartoon

● Memcache wrapped around many DB accesses➞ Case-by-case➞ Manual

● Computed data service (CDS)➞ Provides ML models via Thrift interface

● Rate-limiting around critical services● Search!

➞ Solr➞ Team-partitioned➞ fed from job queue workers

Page 27: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Slack Today: The Good Parts

● Team-partitioning➞ Easy scaling to lots of teams➞ Isolates failures and perf problems➞ Makes customer complaints easy to field➞ Natural fit for a paid product

● Per-team Message Server➞ Low-latency broadcasts

Page 28: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Some Hard Cases

Page 29: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Hard scenarios

● Mains failures● Rtm.start on large teams● Mass reconnects

Page 30: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Mains failure

● 1 master fails, partner takes over

● If both fail?➞ Many users can proceed via memcache➞ For the rest Slack is down➞ Quite possible if failure was load-induced

Page 31: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Rtm.start for large teams

● Returns image of entire team

● Channel membership is O(n2) for n users

Page 32: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Mass reconnects

● A large team loses, then regains, office Internet connectivity

● n users perform O(n2) rtm.start operations

● Can ‘melt’ the team shard

Page 33: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

What are we going to

Doabout it?

Page 34: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Scale-out mains

● Replace mains spof● With what? We’re not sure yet● Kicking the tires carefully on a scary change

Page 35: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Rtm.start for large teams

● Incremental work➞ Current p95,p99: 221ms, 660ms

● Core problem: channel membership is O(n2)● Change APIs so clients can load channel members lazily● Much harder than it sounds!

Page 36: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Mass reconnects

● Introducing flannel

● Application-level edge cache

Page 37: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Pre-Flannel

Message Delivery

Message Server

WebApp

Page 38: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Message Server

Page 39: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Flannel status

● On for a few teams

● Rolling out to you soon with any luck

Page 40: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Phew

Page 41: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Stuff I had to leave out

● Lots of client tech!● Voice● Backups● Data warehouse● Search● Deploying code● Monitoring and alerting

Page 42: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Wrapping up

● Sketch of how Slack works➞ Application Logic➞ Persistence➞ Real-time messaging➞ Asynchronous Work

● Problems● What we’re doing about them

Page 43: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

There is a lot left to doslack.com/jobs

Page 44: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

...

Page 45: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Deployable Message Server

● Channel-sharded message bus

● Flannel discovers Channel servers via Consul➞ Scatters user writes➞ Gathers channel reads

● Failures do not need reconnects