elasticsearch in production

Post on 10-May-2015

983 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Video available at http://www.youtube.com/watch?v=gkdfNl0WL-A Original slides at http://presentations.found.no/berlin-buzzwords-2013/ This talk covers some of the lessons we've learned from securing and herding hundreds of Elasticsearch clusters. It is applicable whether you operate Elasticsearch in your own infrastructure, in the cloud, or if you're a developer who wants a better understanding of Elasticsearch's various failure modes. Elasticsearch easily lets you develop amazing things, and it has gone to great lengths to make Lucene's features readily available in a distributed setting. However, when it comes to running Elasticsearch in production, you still have a fairly complicated system on your hands: a system with high expectations on network stability, a huge appetite for memory, and a system that assumes all users are trustworthy. Instead of delving deeply into a few specifics, we give a brief overview of problems you are likely to run into and suggested solutions to these problems. We cover topics that are applicable to both developers and users with Elasticsearch clusters of every shape and size – with an emphasis on resiliency and security. Basic familiarity with Elasticsearch is assumed.

TRANSCRIPT

How marketing thinks our users feel

How we developers sometimes feel

Who?

Co-founder of Found AS7+ years of search, 2+ Elasticsearch

We manage hundreds of Elasticsearch clusters

… on Amazon's cloud

Agenda

Memory (and stability)Security (and multi-tenancy)

Networking (and reliability)Client (and resiliency)

Memory

Search engines crave memoryCaches, caches, caches

Field- and filter cachesPage cache

Index building

PostgreSQL

Verifies resource usageSafe >>> fast

Uses disk if necessary

Elasticsearch trusts youBuilt for speed

It'll jump if you ask it to

What could possibly go wrong?

OutOfMemoryError

Woah there

I ate all the memories

Your cluster may or may not work any more

May or may not work?

What else was happening at the time?Corrupt cluster state, crashed Netty, …

In short: Don't end up there

Warning signs?

Monitor cache sizes and heap spaceOutgrowing page cache: gradual slowdown

Outgrowing heap space: sudden crash

Understand the memory profileTest realisticly

Bound cache sizes and flush thresholdsv0.90+ takes you longer with field filters, etc.

Large heaps are expensive to garbage collectKeep heap < 32GiB (But test!)

Lots of page cache is good, though!

Security

Elasticsearch trusts everyoneNot its job to do auth(z)

You're the gatekeeper

_search

Read only?Limit indexes / wrap with filters?

Protect the field caches

Arbitrary code execution

Elasticsearch has powerful scripting Not sandboxedOn by default

Any website can reach your machinehttp://127.0.0.1:9200/_search?callback=capture&source=…

Run in a virtual machine

Networking

Elasticsearch is distributedEasy (for a distributed system)

Supports many usage patterns.

Quite common topologyHigh availability, right?

Obey or risk split brains …… and irrecoverable data-loss

Stormy clouds

Zone vs instance failureThundering herds

Optimizing MTTR is not HA

Client considerations

Idempotent/retry-able requests  Use a connection pool.

_bulk / _msearch

Have enough memoryHave a majority of nodes

Don't allow arbitrary search requestsUse retryable requests

top related