lessons from highly scalable architectures at social networking sites

1

Software Engineering in a Cloud World

Lessons from highly-scalable architectures

at social networking sites

Patrick [email protected]

2

Social Networking – Trends 2012more users... … higher share of time ...

… for longer

Source: State of Media: The Social Media Report 2012, nielsen, http://is.gd/LYHmnm

http://is.gd/LYHmnm

3

User Adoption Faster for New Entrants

Source: author's compilations of data from company data, press statements, technical blogs & presentations

0.5 1 2 3 4 5 6 7 80.01

0.1

1

10

100

1000

User Growth

(years since launch)

Facebook

Twitter

Tumblr

Instagram

Pinterest

years

Milli

on (

loga

rithm

ic)

4

Staggering Volumes

Page views 500 million/dayReads ~40k requests/secondWrites ~1 million/secondNew data ~3 TB/dayServers 1000Engineers 20

Sources: http://is.gd/mpdOPN, http://is.gd/1vJ1il, http://is.gd/58X8ns, http://is.gd/LGexI6, http://is.gd/tZfNPA, http://is.gd/bcpCJc, http://is.gd/kXVEEF

Likes (counter) 2.7 billion/dayPhotos 300 millions/dayQueries 70'000/dayNew Data 500 TB/dayServers “tens of thousands”Engineers ~1700

Tweets (peak) ~25'000/secondTweets (avg) ~250 million/day (1000/second)API calls 6 billion/day (70'000/second)New data ~8 TB/day (80MB/second)Engineers 500 (of 1000 total employees)

Page views 2.3 billion/monthGrowth rate 50% (visitors, March 2012)Machinery 150 web servers

90 caching servers70 database instances35 logging, internal

Data size 410 TB (user data)Employees ~65 (NB, until end of 2011: 12)

http://is.gd/mpdOPN

http://is.gd/1vJ1il

http://is.gd/58X8ns

http://is.gd/LGexI6

http://is.gd/tZfNPA

http://is.gd/bcpCJc

http://is.gd/kXVEEF

5

Methodology

● Author's synthesis● Information collected 2010 – 2012● Mostly secondary research conducted on the internet

● Sources of information● Public presentations at industry conferences● Engineering blogs by social network companies● Research reports● Technology documentation● Author's data analysis

● Threats to validity● Subjective selection of information sources● Non-systematic analysis and synthesis of data gathered

6

Typical Scalability Approaches

● Load Balancing

● Static content on dedicated servers

● Caching

● Database Partitioning

● Replication (high availability)

● (How) Do these work at social-network scale?

7

Facebook

Source: Aaditya Agarwal, Facebook Architecture, Qcon'2008, London

Functionality- Type of blog - User profile with personal data- Users 'friend' each-other- Post public or private messages

Data Center- owned by facebook

Software Architecture

8

Twitter

Software architecture- Ruby on Rails, Erlang- since 2009: JVM, Scala - MySQL- Memcached- Unicorn (Mongrel) web server

Functionality- 140-character messages- Users follow each-other- Posts can contain pictures, media links etc.

Data Center- dedicated data center (outsourced)

Source: Krikorian R., Twitter's Real Time Architecture, Qcon NYC 2012

9

tumblr

Software architecture- PHP, Ruby, Scala - Redis, Hbase, MySQL- Memcache- Thrift

Functionality- Microblogging- Users follow each-other- Dashboard similar to a Facebook page

Data Center- started at Rackspace - co-located, dedicated

Source: Tumblr Architecture – 15 Billion Page Views A months and Harder to Scale than Twitter, Highscalability Blog

Source: tumblr.com

10

Pinterest

Data Center- Amazon EC2, EBS, S3

Functionality- Photo sharing pinboards- Categorize images, share with others- mostly used by women (2012: 83%)

Software architecture- Python - Django

Source: pinterest.com

Source: Jackson B., Pinterest growth driven by Amazon cloud scalability, 04.2012, techworld.com

11

Instagram

Software architecture- Python, Django- PostgreSQL- Redis- Nginx- Node.js - Android

Functionality- Smartphone photo sharing- Post to other social networks- Send messages

Data Center- started with single small scale PC (up to 30+ million users)- 100+ instances at Amazon (EC2, EBS, S3 for photos)

Employees- 2010: 2 engineers, 2012: 5 engineers- That's the total employee count

Source: Instagram, What Powers Instagram: Hundreds of Instances, Dozens of Technologies, Instagram Engineering Blog

Source: Wikipedia

12

Scalability Options

scale out

scale up

#CPUsRAMdisk

#machines

●transparent scalability●scale 'out of the box'●complex hardware (high cost)●specialised Knowledge●more complex software (multi-core)

●simple hardware (low cost)●scale by numbers●difficult to implement●difficult to maintain (myth?)●nore complex software (expensive licenses)

either way- scale by parallization- partition for fault tolerance- replicate for reliability

this means:- decouple components - asynchronous processing- monitor to operate

13

Caching

● Goal Reduce response times for web site & data access

● Product memcached (open source, initially developed 2003)

● Benefits All accesses (read & write) are O(1)

14

memchached

Web Server

Load Balancer

Web Server

memcached

memcached

memcached

memcached

client

server = hashf(key) % #servers

Features● Remote-accessible in-memory key/value cache● Least Recently Used (LRU) eviction ● Shared-nothing, distributed architecture

Implementation ● memcached nodes map to key-ranges (client-side hashing – no SPOF)● Multi-threaded, event-based async network I/O (200'000 requests/s at Facebook)● Single-node fault tolerance by consistent hashing scheme

Keys={1,2,3}

Keys={4,5,6}

Keys={7,8,9}

Keys={10,11,12}

Source: memcached.org

16

Consistent Hashing in a nutshell

server = min(s | s.location >= (hashf(key) % #locations))

Consistent hashing: buckets are located on a ring, contain up to pre-defined limit => at worst, only the keys of the failing node need to be re-mapped

Source: David Karget et al, Web caching with consistent hashing, Vol 31, 1999, Computer Networks

Keys={1,2,3}

Keys={3,4,5}Keys={5,6,7}

Keys={8,9,10}

m

Keys={1,2,3}

Keys={1,2,3,4,5}Keys={5,6,7}

Keys={8,9,10}

'Traditional' hashing: buckets contain pre-defined range=> at worst requires re-building the full cache, every node may be affected

17

Memcached Results

● Results at Twitter

● 100s of servers

● 20TB of data covering >30 services

● 2 trillion queries/day (>23 million queries/second)

● Modified memcached, released as “Twemcache”

● Key objectives

● High Availability

● Predictable Performance

● Dynamic adoption to size (grow/shrink)

● Monitoring of cache effectiveness

Source: Chris Aniszczyk, Caching with Twemcache, 07.2012, Twitter Engineering Blog

18

Shard your data

● Shards ● horizontal partitions (e.g. by user, time, ...)

● distributed to multiple physical nodes => parallelized data access

● data typically denormalized

● similar data is replicated to all shards – e.g. static data

node1 node2 node3 node4

Web Server

db-client

node = hashf(userid) % #nodes

Userids={A, …, F}

Userids={G, …, L}

Userids={….}

Userids={….}

19

Sharding Results

● Impressive results at Facebook

● 1800 MySQL servers● 4ms reads, 5ms writes ● 60M queries/second (peak)● Growth 20x (overall data, over two years)

● What work's

● Shard by user – group similar data into the same shard

● Linking across shards – store cross-reference s in both shards (two-way access)● Fault tolerance: single-instance failure only affects subset of users

● Consistent hashing -

● What doesn't

● Join's across shards – not possible efficient● Sharding by time not helpful – one shard keeps running “hot” ● Sharding by function not helpful – non-uniform distribution, hot spots, unique access patterns● Fixed hashing – nodes become unbalanced, difficult to grow or shrink

Source: Facebook Techtalks, MySQL & Hbase, December 5, 2011

20

Managing shards

● Results at Tumblr● 200 db servers● Grouped into 5 global pools / 58 shard pools● 28 TB ● 100 billion rows● No DBAs - 2 engineers keep this running at 50% of their time

● Jetpants – DB management toolkit● Clone slaves efficiently● Split shards into new shards● Master promotions● Command line to work with topology

● Open sourced ● https://github.com/tumblr/jetpants

Source: Elias E., Managing Large Sharded Topologies with Jetpants, 12.2012, Percona Live MySQL Conference

21

Asynchronous & Distributed Work

● Problem Do more work in less time

● Solution Distributed, asynchronous processingMapReduce

● Requirements

● Split work job into multiple pieces

● Distribute work

● Collect results

● Fault tolerant

● Technologies

● Message Queueing

● Gearman

● Hadoop / Pig

https://github.com/tumblr/jetpants

22

Asynchronous Work Example

● Instagram Push Notifications

● Image uploads

● All uploads go into a task-queue

● ~200 worker processes asynchronously process the images

● Gearman

● Open Source

● Framework to distribute work

● Load Balancing

● No SPOF

Source: gearman.org

Source: Instagram, What Powers Instagram: Hundreds of Instances, Dozens of Technologies, 2012, Instagram Engineering Blog

23

Apache Hadoop

● What it is

● Distributed MapReduce engine

● Fault tolerant

● Asynchronous job scheduling

● Scalable: e.g. 4000 node cluster,sorts of 1TB in 62 seconds

● Datastorage

● HDFS – scalable to multiple PB

● Distributed storage

● Written in Java

● Data replicated among 3 nodes

● Block storage of 64MB/block

● No SPOF

● Apache Pig

● High-level query language

Sources: Apache Hadoop, Wikipedia, The Free Encyclopedia, accesses January 8, 2013Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012

24

Results

● NoSQL at Twitter

● Store 7TB of data/day

● HD speed: ~80MB/s => 24.3 hours

● Need to parallelize writes and reads

● Analysis using Pig

● Count all tweets

● 12 billion

● 5 minutes

Source: Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012

25

Simplified Queries

Source: Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012

27

Service Oriented Architecture

“Onion-Style”

outer services- public (e.g. REST)- user interface- typically scripted (Python, Ruby, JavaScript)

inner services- private & highly efficient- data access, calculation etc.- workers to accomplish work in parallel- mix of languages (Java, Scala, Python, C, ...)

fire hose- highly available, scalable service bus- distribute services as needed- typically asynchronous

28

Tumblr Firehose

Apache kafka- O(1) persistent message queue- x times 100K messages/s- pub/sub interface

Apache Zookeeper (Cluster)- distributed coordination - highly available

finagle

finagle- asynchronous RPC system- JVM-hosted languages (Java, Scala, ...)- Connection pools, failure detectors, failover, load-balancing, back-pressure ...

NewPost finagle

HTTP ClientHTTP ClientHTTP Client

Results- 4 x CPUs @ 72GB RAM, 2 disks- provide 1 week of streams- ~400k messages/second- 1 Week of Tumblr posts

public API(JSON)

internal API(thrift)

Source: Blake M., Tumblr Firehose - The Gory Details, 2012, Tumblr Engineering Blog

29

SOA revisited – network efficiency

consumer provider

Inte

rfac

e

1. Serialize2. Wait for response3. Deserialize

1. Deserialize2. Provide response3. Serialize

CORBA, HTTP/JSON, WSDL/XML/SOAP, ...

efficient?

30

Apache thrift – optimized wire protocol

● What it is

● Human-readable interface definition language (non-XML)

● Cross-language service implementation

● Code-generation engine (C++, Java, Python, JavaScript, …)

● Binary wire protocol

● Benefits

● Low-overhead serialization/de-serialization

● Native language bindings (no XML parsing or XSD)

● Efficient protocol implementation

31

thrift example

struct UserProfile { 1: i32 uid, 2: string name, 3: string blurb } service UserStorage { void store(1: UserProfile user), UserProfile retrieve(1: i32 uid) }

# Make an object up = UserProfile(uid=1, name="Test User", blurb="Thrift is great")# Talk to a server via TCP sockets, binary protocol transport = TSocket.TSocket("localhost", 9090) transport.open() Protocol =TBinaryProtocol.TBinaryProtocol(transport) # Use the service we already defined service = UserStorage.Client(protocol) service.store(up) Up2 = service.retrieve(1)

class UserStorageHandler : virtual public UserStorageIf { public: UserStorageHandler() { // Your initialization goes here } void store(const UserProfile& user) { // Your implementation goes here printf("store\n"); } void retrieve(UserProfile& _return, const int32_t uid) { // Your implementation goes here printf("retrieve\n"); } }; //main ... }

interface client

Service implementation

Source: thrift.apache.org

32

Serialization / Deserialization Performance

Serialization … (thrift: -66% )

… Deserialization (thrift: -92%)

Message size (thrift: -19%)

Benchmark - CPU Core i7 2.7GHz - Serialization of a service message (media descriptor of a video)

Source: Author testing

33

redis: In-Memory DB

redis

redis

redis

redis

consumer

Keys={1,2,3}

Keys={3,4,5}

Keys={5,6,7}

Keys={8,9,10}

master

slave

slave

slave

async replication

Problem Require speed of cache, query semantics, persistence, fault-tolerance of DB clusterSolution redis.io – a distributed in-memory DB

Redis● fast: O(1) access times - 100'000 writes/second, 80'000 read/second ● fault-tolerant● datatypes: strings, hashes, lists, sets, sorted sets● complex queries: intersection, subset, sort, …● more than just a DB: pub/sub channels

35

redis results

● tumblr● >7500 notifications/second (well above MySQL max. concurrent limit)

● <5ms response time requirement

● Redis: 30'000 requests/second

Source: Blake M., Staircar: Redis-powered notifications, 07.2011, Tumblr Engineering Blog

36

Automate everything & Monitor

● If just two engineers

● run 100+ servers

● maintain dozens of databases

● Scale a system to 30+ million users

● … automation is like air to breathe …

● … monitoring is the lifelineDashboard @ Twitter

Source: Adams J., Scaling Twitter, 2010, Chirp Conference

37

Cell Architecture

● Cell Architecture

● Self-contained cells of data + logic

● Each cell itself made up of a cluster of nodes

● Cells provide internal failover

● Reliability

● Scalability

Cell

Application Server Cluster

Metadata store (HBase)

Discovery Service

Client

consistent hashing by user-id

Source: Malik P., Scaling the Messages Application Back End, 04.11, facebook Engineering's Notes

38

Summary

Scalability● Cache● Data Sharding● In-Memory DB● Efficient wire protocols

Flexibility● SOA

● Decoupled● Layered (outer, inner services)● Asynchronous (firehouse)

● Automation

Reliability● Replication ● Cell Architecture

39

Take Away for Application Development

● Scalability => Distribution● Loosely Coupled Components (accessible via APIs, services)● Efficiency at every level● Shared nothing

● Reliability => Replication● Automation● Monitoring ● Fast provisioning of replicates

● Flexibility => Simplification

● Build for simple use ● Abstract to simplify (e.g. Pig/Hadoop, Redis/in-Memory DB)● API-everything

40

Paradigm Shift?

● New normal

● 100s of machines

● <5 engineers

● Distributed work load

● Horizontal scalability

● PBs of data

● Drivers

● Low barriers of entry – free or low-cost hosting

● Declining cost – CPU, storage, networking

● Web-scale ready open-source software

41

Q & A

Thank you

42

What we haven't covered

● CAP Theorem

● A/B Testing

● NoSQL Databases