building a modern website for scale (qcon ny 2013)

Recruiting SolutionsRecruiting SolutionsRecruiting Solutions****Sid AnandQCon NY 2013Building a Modern Website for Scale

About Me2*Current Life LinkedIn Search, Network, and Analytics (SNA) Search Infrastructure MeIn a Previous Life LinkedIn, Data Infrastructure, Architect Netflix, Cloud Database Architect eBay, Web Development, Research Lab, & Search EngineAnd Many Years Prior Studying Distributed Systems at Cornell University@r39132 2

Our missionConnect the worlds professionals to makethem more productive and successful3@r39132 3

Over 200M members and counting2 4 8173255901452004 2005 2006 2007 2008 2009 2010 2011 2012LinkedIn Members (Millions)200+The worlds largest professional networkGrowing at more than 2 members/secSource :http://press.linkedin.com/about

5*>88%Fortune 100 Companiesuse LinkedIn Talent Soln to hireCompany Pages>2.9MProfessional searches in 2012>5.7BLanguages19@r39132 5>30MFastest growing demographic:Students and NCGsThe worlds largest professional networkOver 64% of members are now internationalSource :http://press.linkedin.com/about

Other Company Facts6* Headquartered in Mountain View, Calif., with offices around the world! As of June 1, 2013, LinkedIn has ~3,700 full-time employees located around theworld@r39132 6Source :http://press.linkedin.com/about

Agenda Company Overview Serving Architecture How Does LinkedIn Scale Web Services Databases Messaging Other Q & A7@r39132

Serving Architecture8@r39132 8

Overview Our site runs primarily on Java, with some use of Scala for specificinfrastructure The presentation tier is an exception runs on everything! What runs on Scala? Network Graph Engine Kafka Some front ends (Play) Most of our services run on JettyLinkedIn : Serving Architecture@r39132 9

LinkedIn : Serving Architecture@r39132 10FrontierPresentation TierPlaySpringMVCNodeJS JRuby Grails DjangoUSSR (Chrome V8 JS Engine)Our presentation tier is composed of ATS with 2 plugins: Fizzy A content aggregator that unifies content across a diverse set of front-ends Open-source JS templating framework USSR (a.k.a. Unified Server-Side Rendering) Packages Google Chromes V8 JS engine as an ATS plugin

A A B BMasterC CD D E E F FPresentation TierBusiness Service TierData Service TierData InfrastructureSlave Master MasterMemcached A web page requestsinformation A and B A thin layer focused onbuilding the UI. It assemblesthe page by making parallelrequests to BST services Encapsulates businesslogic. Can call other BSTclusters and its own DSTcluster. Encapsulates DAL logic andconcerned with one OracleSchema. Concerned with thepersistent storage of andeasy access to dataLinkedIn : Serving ArchitectureHadoop@r39132 11Other

12Serving Architecture : Other?Oracle orEspresso Data Change EventsSearchIndexGraphIndexReadReplicasUpdatesStandardizationAs I will discuss later, data that is committed to Databases needs to also be madeavailable to a host of other online serving systems : Search Standardization Services These provider canonical names for yourtitles, companies, schools, skills, fields of study, etc.. Graph engine Recommender SystemsThis data change feed needs to be scalable, reliable, and fast. [ Databus ]@r39132

13Serving Architecture : Hadoop@r39132How do we Hadoop to Serve? Hadoop is central to our analytic infrastructure We ship data streams into Hadoop from ourprimary Databases via Databus & fromapplications via Kafka Hadoop jobs take daily or hourly dumps of thisdata and compute data files that Voldemort canload! Voldemort loads these files and serves them onthe site

Voldemort : RO Store Usage at LinkedInPeople You May KnowLinkedIn SkillsRelated SearchesViewers of this profile also viewedEvents you may be interested in Jobs you may beinterested in@r39132 14

How Does LinkedIn Scale?15@r39132 15

Scaling Web ServicesLinkedIn : Web Services16@r39132

LinkedIn : Scaling Web Services@r39132 17Problem How do 150+ web services communicate with each other to fulfill user requests inthe most efficient and fault-tolerant manner? How do they handle slow downstream dependencies? For illustration sake, consider the following scenario: Service B has 2 hosts Service C has 2 hosts A machine in Service B sends a web request to a machine in Service CA A B B C C

LinkedIn : Scaling Web Services@r39132 18What sorts of failure modes are we concerned about? A machine in service C has a long GC pause calls a service that has a long GC pause calls a service that calls a service that has a long GC pause see where I am going? A machine in service C or in its downstream dependencies may be slow for anyreason, not just GC (e.g. bottlenecks on CPU, IO, and memory, lock-contention)Goal : Given all of this, how can we ensure high uptime?Hint : Pick the right architecture and implement best-practices on top of it!

LinkedIn : Scaling Web Services@r39132 19In the early days, LinkedIn made a big bet on Spring and Spring RPC.Issues1. Spring RPC is difficult to debug You cannot call the service using simple command-line tools like Curl Since the RPC call is implemented as a binary payload over HTTP, http accesslogs are not very usefulB BC CLB2. A Spring RPC-based architecture leads to high MTTR Spring RPC is not flexible and pluggable -- we cannot use custom client-side load balancing strategies custom fault-tolerance features Instead, all we can do is to put all of our service nodesbehind a hardware load-balancer & pray! If a Service C node experiences a slowness issue, aNOC engineer needs to be alerted and then manuallyremove it from the LB (MTTR > 30 minutes)

LinkedIn : Scaling Web Services@r39132 20SolutionA better solution is one that we see often in both cloud-based architectures andNoSQL systems : Dynamic Discovery + Client-side load-balancingStep 1 :Service C nodes announce theiravailability to serve traffic to a ZKregistryStep 2 :Service B nodes get updatesfrom ZKB BC CZKZKZKB BC CZKZKZKStep 3 :Service B nodes routetraffic to service C nodesB BC CZKZKZK

LinkedIn : Scaling Web Services@r39132 21With this new paradigm for discovering services and routing requests tothem, we can incorporate additional fault-tolerant services

LinkedIn : Scaling Web Services@r39132 22Best Practices Fault-tolerance Support1. No client should wait indefinitely for a response from a service Issues Waiting causes a traffic jam : all upstream clients end up also gettingblocked Each service has a fixed number of Jetty or Tomcat threads. Oncethose are all tied up waiting, no new requests can be handled Solution After a configurable timeout, return Store different SLAs in ZK for each REST end-points In other words, all calls are not the same and should not havethe same read time out

LinkedIn : Scaling Web Services@r39132 23Best Practices Fault-tolerance Support2. Isolate calls to back-ends from one another Issues You depend on a responses from independent services A and B. If Aslows down, will you still be able to serve B? Details This is a common use-case for federated services and for shard-aggregators : E.g. Search at LinkedIn is federated and will call people-search, job-search, group-search, etc... In parallel E.g. People-search is itself sharded, so an additional shard-aggregation step needs to happen across 100s of shards Solution Use Async requests or independent ExecutorServices for syncrequests (one per each shard or vertical)

LinkedIn : Scaling Web Services@r39132 24Best Practices Fault-tolerance Support3. Cancel Unnecessary Work Issues Work issued down the call-graphs is unnecessary if the clients at thetop of the call graph have already timed out Imagine that as a call reaches half-way down your call-tree, thecaller at the root times out. You will still issue work down the remaining half-depth of yourtree unless you cancel it! Solution A possible approach Root of the call-tree adds (, inProgress status) toMemcached All services pass the tree-UUID down the call-tree (e.g. as aHTTP custom request header) Servlet filters at each hop check whether inProgress == false. Ifso, immediately respond with an empty response

LinkedIn : Scaling Web Services@r39132 25Best Practices Fault-tolerance Support4. Avoid Sending Requests to Hosts that are GCing Issues If a client sends a web request to a host in Service C and if that hostis experiencing a GC pause, the client will wait 50-200ms, dependingon the read time out for the request During that GC pause other requests will also be sent to that nodebefore they all eventually time out Solution Send a GC scout request before every real web request

LinkedIn : Scaling Web Services@r39132 26Why is this a good idea? Scout requests are cheap and provide negligible overhead for requestsStep 1 :A Service B node sends a cheap1 msec TCP request to adedicated scout Netty portStep 2 :If the scout request comesback within 1 msec, send thereal request to the Tomcat orJetty portStep 3 :Else repeat with a differenthost in Service CB BNetty TomcatZKZKZKCB BNetty TomcatZKZKZKCB B ZKZKZKNetty TomcatCNetty TomcatC

LinkedIn : Scaling Web Services@r39132 27Best Practice Fault-tolerance Support5. Services should protect themselves from traffic bursts Issues Service nodes should protect themselves from being over-whelmedby requests This will also protect their downstream servers from beingoverwhelmed Simply setting the tomcat or jetty thread pool size is not always anoption. Often times, these are not configurable per application. Solution Use a sliding window counter. If the counter exceeds a configuredthreshold, return immediately with a 503 (service unavailable) Set threshold below Tomcat or Jetty thread pool size

Espresso : Scaling DatabasesLinkedIn : Databases28@r39132

Espresso : Overview@r39132 29Problem What do we do when we run out of QPS capacity on an Oracle database server? You can only buy yourself out of this problem so far (i.e. buy a bigger box) Read-replicas and memcached will help scale reads, but not writes!Solution EspressoYou need a horizontally-scalable database!Espresso is LinkedIns newest NoSQL store. It offers the following features: Horizontal Scalability Works on commodity hardware Document-centric Avro documents supporting rich-nested data models Schema-evolution is drama free Extensions for Lucene indexing Supports Transactions (within a partition, e.g. memberId) Supports conditional reads & writes using standard HTTP headers (e.g. if-modified-since)

Espresso : Overview@r39132 30Why not use Open-source? Change capture stream (e.g. Databus) Backup-restore Mature storage-engine (innodb)

31 Components Request Routing Tier Consults Cluster Manager todiscover node to route to Forwards request toappropriate storage node Storage Tier Data Store (MySQL) Local Secondary Index(Lucene) Cluster Manager Responsible for data setpartitioning Manages storage nodes Relay Tier Replicates data toconsumersEspresso: Architecture@r39132

Databus : Scaling DatabasesLinkedIn : Database Streams32@r39132

33DataBus : OverviewProblemOur databases (Oracle & Espresso) are used for R/W web-site traffic.However, various services (Search, Graph DB, Standardization, etc) need the abilityto Read the data as it is changed in these OLTP stores Occasionally, scan the contents in order rebuild their entire stateSolution DatabusDatabus provides a consistent, in-time-order stream of database changes that Scales horizontally Protects the source database from high-read-load@r39132

Where Does LinkedIn useDataBus?34@r39132 34

35DataBus : Usage @ LinkedInOracle orEspresso Data Change EventsSearchIndexGraphIndexReadReplicasUpdatesStandardizationA user updates the company, title, & school on his profile. He also accepts aconnection The write is made to an Oracle or Espresso Master and DataBus replicates: the profile change is applied to the Standardization service E.g. the many forms of IBM were canonicalized for search-friendliness andrecommendation-friendliness the profile change is applied to the Search Index service Recruiters can find you immediately by new keywords the connection change is applied to the Graph Index service The user can now start receiving feed updates from his new connections immediately@r39132

RelayEvent Win36DBBootstrapCaptureChangesOn-lineChangesDBDataBus consists of 2 services Relay Service Sharded Maintain an in-memory buffer pershard Each shard polls Oracle and thendeserializes transactions into Avro Bootstrap Service Picks up online changes as theyappear in the Relay Supports 2 types of operationsfrom clients If a client falls behind andneeds records older than whatthe relay has, Bootstrap cansend consolidated deltas! If a new client comes on lineor if an existing client fell toofar behind, Bootstrap cansend a consistent snapshotDataBus : Architecture@r39132

RelayEvent Win37DBBootstrapCaptureChangesOn-lineChangesOn-lineChangesDBConsistentSnapshot at UConsumer 1Consumer nDatabusClientLibClientConsumer 1Consumer nDatabusClientLibClientGuarantees Transactions In-commit-order Delivery commits are replicated in order Durability you can replay the change stream at any time in the future Reliability 0% data loss Low latency If your consumers can keep up with the relay sub-secondresponse timeDataBus : Architecture@r39132

38DataBus : ArchitectureCool Features Server-side (i.e. relay-side & bootstrap-side) filters Problem Say that your consuming service is sharded 100 ways e.g. Member Search Indexes sharded by member_id % 100 index_0, index_1, , index_99 However, you have a single member Databus stream How do you avoid having every shard read data it is not interested in? Solution Easy, Databus already understands the notion of server-side filters It will only send updates to your consumer instance for the shard it isinterested in@r39132

Kafka: Scaling MessagingLinkedIn : Messaging39@r39132

40Kafka : OverviewProblemWe have Databus to stream changes that were committed to a database. How do wecapture and stream high-volume data if we relax the requirement that the data needslong-term durability? In other words, the data can have limited retentionChallenges Needs to handle a large volume of events Needs to be highly-available, scalable, and low-latency Needs to provide limited durability guarantees (e.g. data retained for a week)Solution KafkaKafka is a messaging system that supports topics. Consumers can subscribe to topicsand read all data within the retention window. Consumers are then notified of newmessages as they appear!@r39132

41Kafka is used at LinkedIn for a variety of business-critical needs:Examples: End-user Activity Tracking (a.k.a. Web Tracking) Emails opened Logins Pages Seen Executed Searches Social Gestures : Likes, Sharing, Comments Data Center Operational Metrics Network & System metrics such as TCP metrics (connection resets, message resends, etc) System metrics (iops, CPU, load average, etc)Kafka : Usage @ LinkedIn@r39132

42WebTierTopic 1Broker TierPushEventsTopic 2Topic NZookeeper Message IdManagementTopic, PartitionOwnershipSequential write sendfileKafkaClientLibConsumersPullEvents Iterator 1Iterator nTopic Message Id100 MB/sec 200 MB/sec Pub/Sub Batch Send/Receive E2E Compression System DecouplingFeatures Guarantees At least once delivery Very high throughput Low latency (0.8) Durability (for a time period) Horizontally ScalableKafka : Architecture@r39132 Average Unique Message @Peak writes/sec = 460k reads/sec: 2.3m # topics: 69328 billion unique messages written per dayScale at LinkedIn

43Improvements in 0.8 Low Latency Features Kafka has always been designed for high-throughput, but E2E latency couldhave been as high as 30 seconds Feature 1 : Long-polling For high throughput requests, a consumers request for data will always befulfilled For low throughput requests, a consumers request will likely return 0 bytes,causing the consumer to back-off and wait. What happens if data arrives onthe broker in the meantime? As of 0.8, a consumer can park a request on the broker for as muchas m milliseconds have passed If data arrives during this period, it is instantly returned to theconsumerKafka : Overview@r39132

44Improvements in 0.8 Low Latency Features In the past, data was not visible to a consumer until it was flushed to disk on thebroker. Feature 2 : New Commit Protocol In 0.8, replicas and a new commit protocol has been introduced. As long asdata has been replicated to the memory of all replicas, even if it has notbeen flushed to disk on any one of them, it is considered committed andbecomes visible to consumersKafka : Overview@r39132

Jay Kreps (Kafka) Neha Narkhede (Kafka) Kishore Gopalakrishna (Helix) Bob Shulman (Espresso) Cuong Tran (Perfomance & Scalability) Diego Mono Buthay (Search Infrastructure)45Acknowledgments@r39132

y Questions?46@r39132 46

building a modern website for scale (qcon ny 2013)

Technology

gc pause

web request

system metrics

presentation tier

scaling databaseslinkedin

spring rpc

read time

profile change