elasticsearch at automattic

48
at Tuesday, February 25, 14

Upload: greg-brown

Post on 27-Jan-2015

115 views

Category:

Technology


1 download

DESCRIPTION

Presentation from the Elasticsearch Denver Meetup. Discusses scaling of Elasticsearch for Related Posts across WordPress.com and some of the big changes that were needed in order to scale for 23 million queries a day across 800 million documents.

TRANSCRIPT

Page 1: Elasticsearch at Automattic

at

Tuesday, February 25, 14

Page 2: Elasticsearch at Automattic

Greg Ichneumon

Brown

http://gibrown.wordpress.com@[email protected]

Data Wrangler at Automattic

Tuesday, February 25, 14

Page 3: Elasticsearch at Automattic

Tuesday, February 25, 14

Page 4: Elasticsearch at Automattic

1 Billion Monthly Uniques

Tuesday, February 25, 14

Page 5: Elasticsearch at Automattic

Elasticsearch DeploymentsInternal Search - 216 Internal Blogs - 750k docs [3 GB]Support Documents - KNN Link Prediction - 1.7m docs [14 GB]Polldaddy - Word Clouds/Freq Response - 39m docs [9 GB]

WordPress.com VIP Search - KFF.org - 18m docs [99 MB] - NY Post - 600k docs [2.3 GB]

WordPress.com - ~800m docs [4 TB] - Related Posts - 48 mil reqs/day - search.wordpress.com - 3 mil reqs/day

Tuesday, February 25, 14

Page 6: Elasticsearch at Automattic

Overview of Related Posts

Our “10X Improvements” - Indexing - Querying

Our Open Issues

Tuesday, February 25, 14

Page 7: Elasticsearch at Automattic

Related Posts

Search within just the one blog

Tuesday, February 25, 14

Page 8: Elasticsearch at Automattic

WordPress.comTotal Elasticsearch Operations

Operation Ops/Day

Routed Queries 23 mil

Global Queries 2 mil

Docs Indexed 13 mil

Docs Updated 10 mil

Docs Deleted 2.5 mil

Delete By Query 250k

Tuesday, February 25, 14

Page 9: Elasticsearch at Automattic

Global Cluster

DC2

14 Data

1 Master

DC1

14 Data

1 Master

DC3

14 Data

1 Master

Tuesday, February 25, 14

Page 10: Elasticsearch at Automattic

Our Secret To Scaling

Routed Queries

All Posts for each Blog are on the same Shard

Tuesday, February 25, 14

Page 11: Elasticsearch at Automattic

Global Index

7 Indices10 mil Blogs per Index25 Shards per Index

175 Shards Total

Tuesday, February 25, 14

Page 12: Elasticsearch at Automattic

Overview of Related Posts

Our “10X Improvements” - Indexing - Querying

Our Open Issues

Tuesday, February 25, 14

Page 13: Elasticsearch at Automattic

20% Improvements Don’t solve scaling problems

Tuesday, February 25, 14

Page 14: Elasticsearch at Automattic

Entangling Elasticsearch with Existing Systems

Indexing

Tuesday, February 25, 14

Page 15: Elasticsearch at Automattic

Bulk Indexing 1.0

44 Days to Index all Posts(estimated)

Tuesday, February 25, 14

Page 16: Elasticsearch at Automattic

Bulk Indexing Problems

- Overhead: Spent too much time starting indexing jobs

WordPress.com has 500 mil MySQL tables.

- High DB Load: Corner Cases. Blogs with 1+ mil followers.- High DB Load: Indexing sequentially doesn’t spread the load.- High DB Load: Heavy load on archive DBs.

Tuesday, February 25, 14

Page 17: Elasticsearch at Automattic

Bulk Indexing Today

12.0?

4 Days to Index all Posts(running right now)

Tuesday, February 25, 14

Page 18: Elasticsearch at Automattic

Real Time Indexing

The Hardest Part!

Tuesday, February 25, 14

Page 19: Elasticsearch at Automattic

Real Time Goals

1) Eventually Consistent

2) Minimize Bulk Re-indexing

3) Normally updated < 1 minute

Tuesday, February 25, 14

Page 20: Elasticsearch at Automattic

Real Time Goals

1) Eventually Consistent

2) Minimize Bulk Re-indexing

3) Normally updated < 1 minute

Bulk reindexed 3 times in 5 months.One intentional,

Two during system upgrades.Tuesday, February 25, 14

Page 21: Elasticsearch at Automattic

Stuff Fails

1) Humans

2) Hardware

3) Elasticsearch (steady improvements)

Combinations of the above.

Tuesday, February 25, 14

Page 22: Elasticsearch at Automattic

Hardware Problems

1) Detect and Track Down Servers

2) Prioritize Queries over Indexing

3) Throttle Indexing Jobs

- any issues: block bulk changes to blogs

- >10 min: block doc updates

- >20 min: block all indexing

Tuesday, February 25, 14

Page 23: Elasticsearch at Automattic

Real Time Failures

1) Auto Retry Failed Indexing Jobs

2) Indexing Queue for Failures

3) Scrolling Queries to Find Bad Docs

Tuesday, February 25, 14

Page 24: Elasticsearch at Automattic

Cluster Restarts

Indexing across replicas is non-deterministic

Segments diverge

Slows Restart TimeTuesday, February 25, 14

Page 25: Elasticsearch at Automattic

Simplistic Example

Segments w/ identical checksums

Docs

Primary

Replica

Shard 1 merges

Only first segment is identical

Tuesday, February 25, 14

Page 26: Elasticsearch at Automattic

After Bulk Index

Every segment is out of sync!

Tuesday, February 25, 14

Page 27: Elasticsearch at Automattic

Our Bulk Indexing Procedure

1) Bulk Index All Docs

2) Optimize the index

3) Rolling Restart (sync segments)

4) Future restarts will be much faster.

- Play with recovery settings

- SSDs? => use noop Linux scheduling

Tuesday, February 25, 14

Page 28: Elasticsearch at Automattic

Indexing

It’s all about handling Failures

Tuesday, February 25, 14

Page 29: Elasticsearch at Automattic

Overview of Related Posts

Our “10X Improvements” - Indexing - Querying

Our Open Issues

Tuesday, February 25, 14

Page 30: Elasticsearch at Automattic

Querying

Test and Iterate

Tuesday, February 25, 14

Page 31: Elasticsearch at Automattic

Related Posts Query

Started with MoreLikeThis API.

Did not scale well enough.

Tuesday, February 25, 14

Page 32: Elasticsearch at Automattic

MLT API

1) Get Document

2) Analyze Document

3) Search for Similar Docs

Tuesday, February 25, 14

Page 33: Elasticsearch at Automattic

MLT API vs MLT Query

MLT API MLT Query

147 req/sec 1062 req/sec

40% CPU 30% CPU

306 ms median latency 49.5 ms median latency

All processing by ES Build query in PHP

Tuesday, February 25, 14

Page 34: Elasticsearch at Automattic

Related Posts RelevancyGreat With Long Content

{ "more_like_this":{ "fields":["mlt_content"], "like_text":"Scaling Elasticsearch Part 1: Overview ElasticSearch scaling Search We recently launched Related Posts across WordPress.com, so its time to pop the hood and take a look at what ended up in our engine... ", "percent_terms_to_match":0.08, "boost_terms":5, "analyzer": "en_analyzer"}}

Tuesday, February 25, 14

Page 35: Elasticsearch at Automattic

MLT Query RelevancyUse match or multi_match for

short content.

Average Related Posts CTR

Tuesday, February 25, 14

Page 36: Elasticsearch at Automattic

Language Analyzers

arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, italian, japanese, korean, norwegian, persian, portuguese, romanian, russian, spanish, swedish, turkish, thai

Tuesday, February 25, 14

Page 37: Elasticsearch at Automattic

Related Posts Relevancy

How Important is using the

correct Language Analyzer?

Tuesday, February 25, 14

Page 38: Elasticsearch at Automattic

Related Posts Relevancy

How Important is using the

correct Language Analyzer?

Doubled Click Through Rate

Tuesday, February 25, 14

Page 39: Elasticsearch at Automattic

Unfortunately

Increased Slow Queries

(>1 second)

by 10x

still worth it.Tuesday, February 25, 14

Page 40: Elasticsearch at Automattic

Global Query Performancesearch.wordpress.com

Tuesday, February 25, 14

Page 41: Elasticsearch at Automattic

Parent-Child FilteringBlog Doc

Post Doc

public: true|false

title: “...”

content: “...”

Tuesday, February 25, 14

Page 42: Elasticsearch at Automattic

has_parent Filter

With has_parent Without has_parent

7.6 req/sec 17.5 req/sec

75% CPU 50% CPU

503 ms median latency 207 ms median latency

Requires more Indexing

Querying Across All Shards

Tuesday, February 25, 14

Page 43: Elasticsearch at Automattic

Indexing:

Optimize to Handle Failures

Querying:

Test and Iterate

Tuesday, February 25, 14

Page 44: Elasticsearch at Automattic

Overview of Related Posts

Our “10X Improvements” - Indexing - Querying

Our Open Issues

Tuesday, February 25, 14

Page 45: Elasticsearch at Automattic

Open Issues

Slow Queries (> 1 second)

Getting Better. Shards are too big.Tuesday, February 25, 14

Page 46: Elasticsearch at Automattic

Open Issues

What does it take to scale?

3x Data

5x Queries

Tuesday, February 25, 14

Page 47: Elasticsearch at Automattic

Open Issues

Elasticsearch for Natural

Language Processing?At Scale.

On Live Data.

Tuesday, February 25, 14

Page 48: Elasticsearch at Automattic

http://gibrown.wordpress.com@gregibrown

Feeling Inspired?http://automattic.com/work-with-us/data-wrangler/

Tuesday, February 25, 14