elasticsearch in production
DESCRIPTION
Video available at http://www.youtube.com/watch?v=gkdfNl0WL-A Original slides at http://presentations.found.no/berlin-buzzwords-2013/ This talk covers some of the lessons we've learned from securing and herding hundreds of Elasticsearch clusters. It is applicable whether you operate Elasticsearch in your own infrastructure, in the cloud, or if you're a developer who wants a better understanding of Elasticsearch's various failure modes. Elasticsearch easily lets you develop amazing things, and it has gone to great lengths to make Lucene's features readily available in a distributed setting. However, when it comes to running Elasticsearch in production, you still have a fairly complicated system on your hands: a system with high expectations on network stability, a huge appetite for memory, and a system that assumes all users are trustworthy. Instead of delving deeply into a few specifics, we give a brief overview of problems you are likely to run into and suggested solutions to these problems. We cover topics that are applicable to both developers and users with Elasticsearch clusters of every shape and size – with an emphasis on resiliency and security. Basic familiarity with Elasticsearch is assumed.TRANSCRIPT
Elasticsearch in productionAlex Brasetvik
@alexbrasetvik
How marketing thinks our users feel
How we developers sometimes feel
Who?
Co-founder of Found AS7+ years of search, 2+ Elasticsearch
We manage hundreds of Elasticsearch clusters
… on Amazon's cloud
Agenda
Memory (and stability)Security (and multi-tenancy)
Networking (and reliability)Client (and resiliency)
Memory
Search engines crave memoryCaches, caches, caches
Field- and filter cachesPage cache
Index building
PostgreSQL
Verifies resource usageSafe >>> fast
Uses disk if necessary
Elasticsearch trusts youBuilt for speed
It'll jump if you ask it to
What could possibly go wrong?
OutOfMemoryError
Woah there
I ate all the memories
Your cluster may or may not work any more
May or may not work?
What else was happening at the time?Corrupt cluster state, crashed Netty, …
In short: Don't end up there
Warning signs?
Monitor cache sizes and heap spaceOutgrowing page cache: gradual slowdown
Outgrowing heap space: sudden crash
Understand the memory profileTest realisticly
Bound cache sizes and flush thresholdsv0.90+ takes you longer with field filters, etc.
Large heaps are expensive to garbage collectKeep heap < 32GiB (But test!)
Lots of page cache is good, though!
Security
Elasticsearch trusts everyoneNot its job to do auth(z)
You're the gatekeeper
_search
Read only?Limit indexes / wrap with filters?
Protect the field caches
Arbitrary code execution
Elasticsearch has powerful scripting Not sandboxedOn by default
Any website can reach your machinehttp://127.0.0.1:9200/_search?callback=capture&source=…
Run in a virtual machine
Networking
Elasticsearch is distributedEasy (for a distributed system)
Supports many usage patterns.
Quite common topologyHigh availability, right?
Obey or risk split brains …… and irrecoverable data-loss
Stormy clouds
Zone vs instance failureThundering herds
Optimizing MTTR is not HA
Client considerations
Idempotent/retry-able requests Use a connection pool.
_bulk / _msearch
Have enough memoryHave a majority of nodes
Don't allow arbitrary search requestsUse retryable requests
Alex over Trondheim, Tore HelgedagsrudElephant, Roy CostelloWingsuit, Richard SchneiderLightning Storm and Stars, Justin EnnisWingsuit flock, Richard SchneiderOh salad, you so funny, Eatliver