sharding @ instagram - postgresql · sharding @ instagram sfpug april 2012 mike krieger instagram...

Post on 15-Sep-2018

247 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Sharding InstagramSFPUG April 2012

Mike KriegerInstagram

Monday April 23 12

me

- Co-founder Instagram

- Previously UX amp Front-end Meebo

- Stanford HCI BSMS

- mikeyk on everything

Monday April 23 12

pug

Monday April 23 12

communicating and sharing in the real world

Monday April 23 12

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

me

- Co-founder Instagram

- Previously UX amp Front-end Meebo

- Stanford HCI BSMS

- mikeyk on everything

Monday April 23 12

pug

Monday April 23 12

communicating and sharing in the real world

Monday April 23 12

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

pug

Monday April 23 12

communicating and sharing in the real world

Monday April 23 12

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

communicating and sharing in the real world

Monday April 23 12

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

thanks any qs

Monday April 23 12

top related