lessons from highly scalable architectures at social networking sites
DESCRIPTION
What are the techniques and technolgies used by popular social networking sites such as Facebook, Twitter, Tumblr, Pinterest or Instagram? How do they architect their systems to scale to multiples of 100 million of visits per day?TRANSCRIPT
1
Software Engineering in a Cloud World
Lessons from highly-scalable architectures
at social networking sites
Patrick [email protected]
2
Social Networking – Trends 2012more users... … higher share of time ...
… for longer
Source: State of Media: The Social Media Report 2012, nielsen, http://is.gd/LYHmnm
3
User Adoption Faster for New Entrants
Source: author's compilations of data from company data, press statements, technical blogs & presentations
0.5 1 2 3 4 5 6 7 80.01
0.1
1
10
100
1000
User Growth
(years since launch)
Tumblr
years
Milli
on (
loga
rithm
ic)
4
Staggering Volumes
Page views 500 million/dayReads ~40k requests/secondWrites ~1 million/secondNew data ~3 TB/dayServers 1000Engineers 20
Sources: http://is.gd/mpdOPN, http://is.gd/1vJ1il, http://is.gd/58X8ns, http://is.gd/LGexI6, http://is.gd/tZfNPA, http://is.gd/bcpCJc, http://is.gd/kXVEEF
Likes (counter) 2.7 billion/dayPhotos 300 millions/dayQueries 70'000/dayNew Data 500 TB/dayServers “tens of thousands”Engineers ~1700
Tweets (peak) ~25'000/secondTweets (avg) ~250 million/day (1000/second)API calls 6 billion/day (70'000/second)New data ~8 TB/day (80MB/second)Engineers 500 (of 1000 total employees)
Page views 2.3 billion/monthGrowth rate 50% (visitors, March 2012)Machinery 150 web servers
90 caching servers70 database instances35 logging, internal
Data size 410 TB (user data)Employees ~65 (NB, until end of 2011: 12)
5
Methodology
● Author's synthesis● Information collected 2010 – 2012● Mostly secondary research conducted on the internet
● Sources of information● Public presentations at industry conferences● Engineering blogs by social network companies● Research reports● Technology documentation● Author's data analysis
● Threats to validity● Subjective selection of information sources● Non-systematic analysis and synthesis of data gathered
6
Typical Scalability Approaches
● Load Balancing
● Static content on dedicated servers
● Caching
● Database Partitioning
● Replication (high availability)
● (How) Do these work at social-network scale?
7
Source: Aaditya Agarwal, Facebook Architecture, Qcon'2008, London
Functionality- Type of blog - User profile with personal data- Users 'friend' each-other- Post public or private messages
Data Center- owned by facebook
Software Architecture
8
Software architecture- Ruby on Rails, Erlang- since 2009: JVM, Scala - MySQL- Memcached- Unicorn (Mongrel) web server
Functionality- 140-character messages- Users follow each-other- Posts can contain pictures, media links etc.
Data Center- dedicated data center (outsourced)
Source: Krikorian R., Twitter's Real Time Architecture, Qcon NYC 2012
9
tumblr
Software architecture- PHP, Ruby, Scala - Redis, Hbase, MySQL- Memcache- Thrift
Functionality- Microblogging- Users follow each-other- Dashboard similar to a Facebook page
Data Center- started at Rackspace - co-located, dedicated
Source: Tumblr Architecture – 15 Billion Page Views A months and Harder to Scale than Twitter, Highscalability Blog
Source: tumblr.com
10
Data Center- Amazon EC2, EBS, S3
Functionality- Photo sharing pinboards- Categorize images, share with others- mostly used by women (2012: 83%)
Software architecture- Python - Django
Source: pinterest.com
Source: Jackson B., Pinterest growth driven by Amazon cloud scalability, 04.2012, techworld.com
11
Software architecture- Python, Django- PostgreSQL- Redis- Nginx- Node.js - Android
Functionality- Smartphone photo sharing- Post to other social networks- Send messages
Data Center- started with single small scale PC (up to 30+ million users)- 100+ instances at Amazon (EC2, EBS, S3 for photos)
Employees- 2010: 2 engineers, 2012: 5 engineers- That's the total employee count
Source: Instagram, What Powers Instagram: Hundreds of Instances, Dozens of Technologies, Instagram Engineering Blog
Source: Wikipedia
12
Scalability Options
scale out
scale up
#CPUsRAMdisk
#machines
●transparent scalability●scale 'out of the box'●complex hardware (high cost)●specialised Knowledge●more complex software (multi-core)
●simple hardware (low cost)●scale by numbers●difficult to implement●difficult to maintain (myth?)●nore complex software (expensive licenses)
either way- scale by parallization- partition for fault tolerance- replicate for reliability
this means:- decouple components - asynchronous processing- monitor to operate
13
Caching
● Goal Reduce response times for web site & data access
● Product memcached (open source, initially developed 2003)
● Benefits All accesses (read & write) are O(1)
14
memchached
Web Server
Load Balancer
Web Server
memcached
memcached
memcached
memcached
client
server = hashf(key) % #servers
Features● Remote-accessible in-memory key/value cache● Least Recently Used (LRU) eviction ● Shared-nothing, distributed architecture
Implementation ● memcached nodes map to key-ranges (client-side hashing – no SPOF)● Multi-threaded, event-based async network I/O (200'000 requests/s at Facebook)● Single-node fault tolerance by consistent hashing scheme
Keys={1,2,3}
Keys={4,5,6}
Keys={7,8,9}
Keys={10,11,12}
Source: memcached.org
16
Consistent Hashing in a nutshell
server = min(s | s.location >= (hashf(key) % #locations))
Consistent hashing: buckets are located on a ring, contain up to pre-defined limit => at worst, only the keys of the failing node need to be re-mapped
Source: David Karget et al, Web caching with consistent hashing, Vol 31, 1999, Computer Networks
Keys={1,2,3}
Keys={3,4,5}Keys={5,6,7}
Keys={8,9,10}
m
Keys={1,2,3}
Keys={1,2,3,4,5}Keys={5,6,7}
Keys={8,9,10}
'Traditional' hashing: buckets contain pre-defined range=> at worst requires re-building the full cache, every node may be affected
17
Memcached Results
● Results at Twitter
● 100s of servers
● 20TB of data covering >30 services
● 2 trillion queries/day (>23 million queries/second)
● Modified memcached, released as “Twemcache”
● Key objectives
● High Availability
● Predictable Performance
● Dynamic adoption to size (grow/shrink)
● Monitoring of cache effectiveness
Source: Chris Aniszczyk, Caching with Twemcache, 07.2012, Twitter Engineering Blog
18
Shard your data
● Shards ● horizontal partitions (e.g. by user, time, ...)
● distributed to multiple physical nodes => parallelized data access
● data typically denormalized
● similar data is replicated to all shards – e.g. static data
node1 node2 node3 node4
Web Server
db-client
node = hashf(userid) % #nodes
Userids={A, …, F}
Userids={G, …, L}
Userids={….}
Userids={….}
19
Sharding Results
● Impressive results at Facebook
● 1800 MySQL servers● 4ms reads, 5ms writes ● 60M queries/second (peak)● Growth 20x (overall data, over two years)
● What work's
● Shard by user – group similar data into the same shard
● Linking across shards – store cross-reference s in both shards (two-way access)● Fault tolerance: single-instance failure only affects subset of users
● Consistent hashing -
● What doesn't
● Join's across shards – not possible efficient● Sharding by time not helpful – one shard keeps running “hot” ● Sharding by function not helpful – non-uniform distribution, hot spots, unique access patterns● Fixed hashing – nodes become unbalanced, difficult to grow or shrink
Source: Facebook Techtalks, MySQL & Hbase, December 5, 2011
20
Managing shards
● Results at Tumblr● 200 db servers● Grouped into 5 global pools / 58 shard pools● 28 TB ● 100 billion rows● No DBAs - 2 engineers keep this running at 50% of their time
● Jetpants – DB management toolkit● Clone slaves efficiently● Split shards into new shards● Master promotions● Command line to work with topology
● Open sourced ● https://github.com/tumblr/jetpants
Source: Elias E., Managing Large Sharded Topologies with Jetpants, 12.2012, Percona Live MySQL Conference
21
Asynchronous & Distributed Work
● Problem Do more work in less time
● Solution Distributed, asynchronous processingMapReduce
● Requirements
● Split work job into multiple pieces
● Distribute work
● Collect results
● Fault tolerant
● Technologies
● Message Queueing
● Gearman
● Hadoop / Pig
22
Asynchronous Work Example
● Instagram Push Notifications
● Image uploads
● All uploads go into a task-queue
● ~200 worker processes asynchronously process the images
● Gearman
● Open Source
● Framework to distribute work
● Load Balancing
● No SPOF
Source: gearman.org
Source: Instagram, What Powers Instagram: Hundreds of Instances, Dozens of Technologies, 2012, Instagram Engineering Blog
23
Apache Hadoop
● What it is
● Distributed MapReduce engine
● Fault tolerant
● Asynchronous job scheduling
● Scalable: e.g. 4000 node cluster,sorts of 1TB in 62 seconds
● Datastorage
● HDFS – scalable to multiple PB
● Distributed storage
● Written in Java
● Data replicated among 3 nodes
● Block storage of 64MB/block
● No SPOF
● Apache Pig
● High-level query language
Sources: Apache Hadoop, Wikipedia, The Free Encyclopedia, accesses January 8, 2013Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012
24
Results
● NoSQL at Twitter
● Store 7TB of data/day
● HD speed: ~80MB/s => 24.3 hours
● Need to parallelize writes and reads
● Analysis using Pig
● Count all tweets
● 12 billion
● 5 minutes
Source: Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012
25
Simplified Queries
Source: Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012
27
Service Oriented Architecture
“Onion-Style”
outer services- public (e.g. REST)- user interface- typically scripted (Python, Ruby, JavaScript)
inner services- private & highly efficient- data access, calculation etc.- workers to accomplish work in parallel- mix of languages (Java, Scala, Python, C, ...)
fire hose- highly available, scalable service bus- distribute services as needed- typically asynchronous
28
Tumblr Firehose
Apache kafka- O(1) persistent message queue- x times 100K messages/s- pub/sub interface
Apache Zookeeper (Cluster)- distributed coordination - highly available
finagle
finagle- asynchronous RPC system- JVM-hosted languages (Java, Scala, ...)- Connection pools, failure detectors, failover, load-balancing, back-pressure ...
NewPost finagle
HTTP ClientHTTP ClientHTTP Client
Results- 4 x CPUs @ 72GB RAM, 2 disks- provide 1 week of streams- ~400k messages/second- 1 Week of Tumblr posts
public API(JSON)
internal API(thrift)
Source: Blake M., Tumblr Firehose - The Gory Details, 2012, Tumblr Engineering Blog
29
SOA revisited – network efficiency
consumer provider
Inte
rfac
e
1. Serialize2. Wait for response3. Deserialize
1. Deserialize2. Provide response3. Serialize
CORBA, HTTP/JSON, WSDL/XML/SOAP, ...
efficient?
30
Apache thrift – optimized wire protocol
● What it is
● Human-readable interface definition language (non-XML)
● Cross-language service implementation
● Code-generation engine (C++, Java, Python, JavaScript, …)
● Binary wire protocol
● Benefits
● Low-overhead serialization/de-serialization
● Native language bindings (no XML parsing or XSD)
● Efficient protocol implementation
31
thrift example
struct UserProfile { 1: i32 uid, 2: string name, 3: string blurb } service UserStorage { void store(1: UserProfile user), UserProfile retrieve(1: i32 uid) }
# Make an object up = UserProfile(uid=1, name="Test User", blurb="Thrift is great")# Talk to a server via TCP sockets, binary protocol transport = TSocket.TSocket("localhost", 9090) transport.open() Protocol =TBinaryProtocol.TBinaryProtocol(transport) # Use the service we already defined service = UserStorage.Client(protocol) service.store(up) Up2 = service.retrieve(1)
class UserStorageHandler : virtual public UserStorageIf { public: UserStorageHandler() { // Your initialization goes here } void store(const UserProfile& user) { // Your implementation goes here printf("store\n"); } void retrieve(UserProfile& _return, const int32_t uid) { // Your implementation goes here printf("retrieve\n"); } }; //main ... }
interface client
Service implementation
Source: thrift.apache.org
32
Serialization / Deserialization Performance
Serialization … (thrift: -66% )
… Deserialization (thrift: -92%)
Message size (thrift: -19%)
Benchmark - CPU Core i7 2.7GHz - Serialization of a service message (media descriptor of a video)
Source: Author testing
33
redis: In-Memory DB
redis
redis
redis
redis
consumer
Keys={1,2,3}
Keys={3,4,5}
Keys={5,6,7}
Keys={8,9,10}
master
slave
slave
slave
async replication
Problem Require speed of cache, query semantics, persistence, fault-tolerance of DB clusterSolution redis.io – a distributed in-memory DB
Redis● fast: O(1) access times - 100'000 writes/second, 80'000 read/second ● fault-tolerant● datatypes: strings, hashes, lists, sets, sorted sets● complex queries: intersection, subset, sort, …● more than just a DB: pub/sub channels
35
redis results
● tumblr● >7500 notifications/second (well above MySQL max. concurrent limit)
● <5ms response time requirement
● Redis: 30'000 requests/second
Source: Blake M., Staircar: Redis-powered notifications, 07.2011, Tumblr Engineering Blog
36
Automate everything & Monitor
● If just two engineers
● run 100+ servers
● maintain dozens of databases
● Scale a system to 30+ million users
● … automation is like air to breathe …
● … monitoring is the lifelineDashboard @ Twitter
Source: Adams J., Scaling Twitter, 2010, Chirp Conference
37
Cell Architecture
● Cell Architecture
● Self-contained cells of data + logic
● Each cell itself made up of a cluster of nodes
● Cells provide internal failover
● Reliability
● Scalability
Cell
Application Server Cluster
Metadata store (HBase)
Discovery Service
Client
consistent hashing by user-id
Source: Malik P., Scaling the Messages Application Back End, 04.11, facebook Engineering's Notes
38
Summary
Scalability● Cache● Data Sharding● In-Memory DB● Efficient wire protocols
Flexibility● SOA
● Decoupled● Layered (outer, inner services)● Asynchronous (firehouse)
● Automation
Reliability● Replication ● Cell Architecture
39
Take Away for Application Development
● Scalability => Distribution● Loosely Coupled Components (accessible via APIs, services)● Efficiency at every level● Shared nothing
● Reliability => Replication● Automation● Monitoring ● Fast provisioning of replicates
● Flexibility => Simplification
● Build for simple use ● Abstract to simplify (e.g. Pig/Hadoop, Redis/in-Memory DB)● API-everything
40
Paradigm Shift?
● New normal
● 100s of machines
● <5 engineers
● Distributed work load
● Horizontal scalability
● PBs of data
● Drivers
● Low barriers of entry – free or low-cost hosting
● Declining cost – CPU, storage, networking
● Web-scale ready open-source software
41
Q & A
Thank you
42
What we haven't covered
● CAP Theorem
● A/B Testing
● NoSQL Databases