lessons from sharding solr at etsy: presented by gregg donovan, etsy

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Lessons from Sharding Solr at Etsy Gregg Donovan

@greggdonovan Senior Software Engineer, etsy.com

http://etsy.com

• 5.5 Years Solr & Lucene at Etsy.com

• 3 Years Solr & Lucene at TheLadders.com

• Speaker at LuceneRevolution 2011 & 2013

http://TheLadders.com

Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems

1.5Million Active Shops

32Million Items Listed

21.7Million Active Buyers

Agenda

• Sharding Solr at Etsy V0 — No sharding

• Sharding Solr at Etsy V1 — Local sharding

• Sharding Solr at Etsy V2 (*) — Distributed sharding

• Questions

* —What we’re about to launch.

Sharding V0 — Not Sharding• Why do we shard?

• Data size grows beyond RAM on a single box

• Lucene can handle this, but there’s a performance cost

• Data size grows beyond local disk

• Latency requirements

• Not sharding allowed us to avoid many problems we’ll discuss later.

Sharding V0 — Not Sharding• How to keep data size small enough for one host?

• Don’t store anything other than IDs

• fl=pk_id,fk_id,score

• Keep materialized objects in memcached

• Only index fields needed

• Prune index after experiments add fields

• Get more RAM

Sharding V0 — Not Sharding• How does it fail?

• GC

• Solution

• “Banner” protocol

• Client-side load balancer

• Client connects, waits for 4-bytes — OxCODEA5CF— from the server within 1-10ms before

sending query. Otherwise, try another server.

Sharding V1 — Local Sharding• Motivations

• Better latency

• Smaller JVMs

• Tough to open a 31gb heap dump on your laptop

• Working set still fit in RAM on one box.

• What’s the simplest system we can built?

Sharding V1 — Local Sharding• Lucene parallelism

• Shikhar Bhushan at Etsy experimented with segment level parallelism

• See Search-time Parallelism at Lucene Revolution 2014

• Made its way into LUCENE-6294 (Generalize how IndexSearcher parallelizes collection

execution). Committed in Lucene 5.1.

• Ended up with eight Solr shards per host, each in its own small JVM

• Moved query generation and re-ranking to separate process: the “mixer”

Sharding V1 — Local Sharding• Based on Solr distributed search

• By default, Solr does two-pass distributed search

• First pass gets top IDs

• Second pass fetches stored fields for each top document

• Implemented distrib.singlePass mode (SOLR-5768)

• Does not make sense if individual documents are expensive to fetch

• Basic request tracing via HTTP headers (SOLR-5969)

Sharding V1 — Local Sharding• Required us to fetch 1000+ results from each shard for reranking layer

• How to efficiently fetch 1000 documents per shard?

• Use Solr’s field syntax to fetch data from FieldCache

• e.g. fl=pk_id:field(pk_id),fk_id:field(fk_id),score

• When all fields are “pseudo” fields, no need to fetch stored fields per document.

Sharding V1 — Local Sharding• Result

• Very large latency win

• Easy system to manage

• Well understood failure and recovery

• Avoided solving many distributed systems issues

Sharding V2 — Distributed Sharding• Motivation

• Further latency improvements

• Prepare for data to exceed a single node’s capacity

• Significant latency improvements require finer sharding, more CPUs per request

• Requires a real distributed system and sophisticated RPC

• Before proceeding, stop what you’re doing and read everything by Google’s Jeff Dean and

Twitter’s Marius Eriksen

Sharding V2 — Distributed Sharding• New problems

• Partial failures

• Lagging shards

• Synchronizing cluster state and configuration

• Network partitions

• Jespen

• Distributed IDF issues exacerbated

Solving Distributed IDF • Inverse Document Frequency (IDF) now varies across shards, biasing ranking

• Calculate IDF offline in Hadoop

• IDFReplacedSimilarityFactory

• Offline data populates cache of Map<BytesRef,Float> (term —> score)

• Override SimilarityFactory#idfExplain

• Cache misses given rare document constant

• Can be extended to solve i18n IDF issues

Sharding V2 — Distributed Sharding

• ShardHandler

• Solr’s abstraction for fanning out queries to shards

• Ships with default implementation (HttpShardHandler) based on HTTP 1.1

• Does fanout (distrib=true) and processes requests coming from other Solr nodes

(distrib=false).

• Reads shards.rows and shards.start parameters

ShardHandler APISolr’s SearchHandler calls submit for each shard and then either takeCompletedIncludingErrors or takeCompletedOrError depending on partial results tolerance.

public abstract class ShardHandler {

public abstract void checkDistributed(ResponseBuilder rb); public abstract void submit(ShardRequest sreq, String shard, ModifiableSolrParams params);

public abstract ShardResponse takeCompletedIncludingErrors();

public abstract ShardResponse takeCompletedOrError();

public abstract void cancelAll();

public abstract ShardHandlerFactory getShardHandlerFactory();}


Distributed query requirements

• Distributed tracing

• E.g.: Google’s Dapper, Twitter’s Zipkin, Etsy’s CrossStich

• Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

• Handle node failures, slowness

Better Know Your SwitchesHave a clear understanding of your networking requirements and whether your hardware meets

them.

• Prefer line-rate switches

• Prefer cut-through to store-and-forward

• No buffering, just read the IP packet header and move packet to the destination

• Track and graph switch statistics in the same dashboard you display your search latency stats

• errors, retransmits, etc.

Sharding V2 — Distributed ShardingFirst experiment, Twitter’s Finagle

• Built on Netty

• Mux RPC multiplexing protocol

• SeeYour Server as a Function by Marius Eriksen

• Built-in support for Zipkin distributed tracing

• Served as inspiration for Facebook’s futures-based RPC Wangle

• Implemented a FinagleShardHandler


Second experiment, custom Thrift-based protocol

• Blocking I/O easier to integrate with SolrJ API

• Able to integrate our own distributed tracing

• LZ4 compression via a custom Thrift TTransport


Future experiment: HTTP/2

• One TCP connection for all requests between two servers

• Libraries

• Square’s OkHttp

• Google’s gRpc

• Jetty client in 9.3+ — appears to be Solr’s choice


Implementation note

• Separated fanout from individual request processing

• SolrJ client via an EmbeddedSolrServer containing empty RAM directory.

• Saves a network hop

• Makes shards easier to profile, tune

• Can return result to SolrJ without sending merged results over the network


• Good

• Individual shard times demonstrate very low average latency

• Bad

• Overall p95, p99 nowhere near averages

• Why? Lagging shards due to GC, filterCache misses, etc.

• More shards means more chances to hit outliers

Sharding V2 — Distributed Sharding• Solutions

• See The Tail at Scale by Jeff Dean, CACM 2013.

• Eliminate all sources of inter-host variability

• No filter or other cache misses

• No GC

• Eliminate OS pauses, networking hiccups, deploys, restarts, etc.

• Not realistic

Sharding V2 — Distributed Sharding• Backup Requests

• Methods

• Brute force — send two copies of every request to different hosts, take the fastest

response

• Less crude — wait X milliseconds for the first server to respond, then send a backup

request.

• Adaptive — choose X based on the first Y% of responses to return.

• Cancellation — Cancel the slow request to save CPU once you’re sure you don’t need it.

Sharding V2 — Distributed Sharding• “Good enough”

• Return results to user after X% of results return if there are enough results. Don’t issue

backup requests, just cancel laggards.

• Only applicable in certain domains.

• Poses questions:

• Should you cache partial results?

• How is paging effected?

Resilience TestingNow you own a distributed system. How do you know it works?

• “The Troublemaker”

• Inspired by Netflix’s Chaos Monkey

• Authored by Etsy’s Toria Gibbs

• Make sure humans can operate it

• Failure simulation — don’t wait until 3am

• Gameday exercises and Runbooks

Bonus material!

Better Know Your KernelA lesson not about sharding learned while sharding…

• Linux’s futex_wait() was broken in CentOS 6.6

• Backported patches needed from Linux 3.18

• Future direction: make kernel updates independent from distribution updates

• E.g. Plenty of good stuff (e.g. networking improvements, kernel introspection [see

@brendangregg]) between 3.10 and 4.2+, but it won’t come to CentOS for years

• Updating kernel alone easier to roll out

What else are we working on?

• Mesos for cluster orchestration

• GPUs for massive increases in per query computational capacity

Thanks for coming.

[email protected]

@greggdonovan

mailto:[email protected]

Questions?

@greggdonovan [email protected]

lessons from sharding solr at etsy: presented by gregg donovan, etsy

Technology