elasticsearch in production

Elasticsearch in productionAlex Brasetvik

@alexbrasetvik

http://www.found.no/



https://twitter.com/alexbrasetvik

https://twitter.com/alexbrasetvik

How marketing thinks our users feel

How we developers sometimes feel

Who?

Co-founder of Found AS7+ years of search, 2+ Elasticsearch

We manage hundreds of Elasticsearch clusters

… on Amazon's cloud

Agenda

Memory (and stability)Security (and multi-tenancy)

Networking (and reliability)Client (and resiliency)




Memory

Search engines crave memoryCaches, caches, caches

Field- and filter cachesPage cache

Index building




PostgreSQL

Verifies resource usageSafe >>> fast

Uses disk if necessary

Elasticsearch trusts youBuilt for speed

It'll jump if you ask it to

What could possibly go wrong?

OutOfMemoryError

Woah there

I ate all the memories

Your cluster may or may not work any more

May or may not work?

What else was happening at the time?Corrupt cluster state, crashed Netty, …

In short: Don't end up there




Warning signs?

Monitor cache sizes and heap spaceOutgrowing page cache: gradual slowdown

Outgrowing heap space: sudden crash




Understand the memory profileTest realisticly

Bound cache sizes and flush thresholdsv0.90+ takes you longer with field filters, etc.




Large heaps are expensive to garbage collectKeep heap < 32GiB (But test!)

Lots of page cache is good, though!




Security

Elasticsearch trusts everyoneNot its job to do auth(z)

You're the gatekeeper




_search

Read only?Limit indexes / wrap with filters?

Protect the field caches




Arbitrary code execution

Elasticsearch has powerful scripting Not sandboxedOn by default




Any website can reach your machinehttp://127.0.0.1:9200/_search?callback=capture&source=…

Run in a virtual machine




http://127.0.0.1:9200/_search?callback=capture&source=

http://127.0.0.1:9200/_search?callback=capture&source=

Networking

Elasticsearch is distributedEasy (for a distributed system)

Supports many usage patterns.




Quite common topologyHigh availability, right?




Obey or risk split brains …… and irrecoverable data-loss




+1 is a "tie breaker"




Stormy clouds

Zone vs instance failureThundering herds

Optimizing MTTR is not HA

Client considerations

Idempotent/retry-able requests Use a connection pool.

_bulk / _msearch




Have enough memoryHave a majority of nodes

Don't allow arbitrary search requestsUse retryable requests

Alex over Trondheim, Tore HelgedagsrudElephant, Roy CostelloWingsuit, Richard SchneiderLightning Storm and Stars, Justin EnnisWingsuit flock, Richard SchneiderOh salad, you so funny, Eatliver




http://www.flickr.com/photos/roycostello/4458744326/

http://www.flickr.com/photos/roycostello/4458744326/

http://www.flickr.com/photos/picturecorrect/7623542822/


http://www.flickr.com/photos/averain/7575347526/

http://www.flickr.com/photos/averain/7575347526/



http://www.eatliver.com/i.php?n=6691

http://www.eatliver.com/i.php?n=6691

elasticsearch in production

Technology

monitor cache sizes

years of search

eld caches

eld lters

garbage collectkeep

elasticsearch trusts

memoriesyour cluster

richard schneideroh