elasticsearch at automattic

Post on 27-Jan-2015

119 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation from the Elasticsearch Denver Meetup. Discusses scaling of Elasticsearch for Related Posts across WordPress.com and some of the big changes that were needed in order to scale for 23 million queries a day across 800 million documents.

TRANSCRIPT

at

Tuesday, February 25, 14

Greg Ichneumon

Brown

http://gibrown.wordpress.com@gregibrowngreg@automattic.com

Data Wrangler at Automattic

Tuesday, February 25, 14

Tuesday, February 25, 14

1 Billion Monthly Uniques

Tuesday, February 25, 14

Elasticsearch DeploymentsInternal Search - 216 Internal Blogs - 750k docs [3 GB]Support Documents - KNN Link Prediction - 1.7m docs [14 GB]Polldaddy - Word Clouds/Freq Response - 39m docs [9 GB]

WordPress.com VIP Search - KFF.org - 18m docs [99 MB] - NY Post - 600k docs [2.3 GB]

WordPress.com - ~800m docs [4 TB] - Related Posts - 48 mil reqs/day - search.wordpress.com - 3 mil reqs/day

Tuesday, February 25, 14

Overview of Related Posts

Our “10X Improvements” - Indexing - Querying

Our Open Issues

Tuesday, February 25, 14

Related Posts

Search within just the one blog

Tuesday, February 25, 14

WordPress.comTotal Elasticsearch Operations

Operation Ops/Day

Routed Queries 23 mil

Global Queries 2 mil

Docs Indexed 13 mil

Docs Updated 10 mil

Docs Deleted 2.5 mil

Delete By Query 250k

Tuesday, February 25, 14

Global Cluster

DC2

14 Data

1 Master

DC1

14 Data

1 Master

DC3

14 Data

1 Master

Tuesday, February 25, 14

Our Secret To Scaling

Routed Queries

All Posts for each Blog are on the same Shard

Tuesday, February 25, 14

Global Index

7 Indices10 mil Blogs per Index25 Shards per Index

175 Shards Total

Tuesday, February 25, 14

Overview of Related Posts

Our “10X Improvements” - Indexing - Querying

Our Open Issues

Tuesday, February 25, 14

20% Improvements Don’t solve scaling problems

Tuesday, February 25, 14

Entangling Elasticsearch with Existing Systems

Indexing

Tuesday, February 25, 14

Bulk Indexing 1.0

44 Days to Index all Posts(estimated)

Tuesday, February 25, 14

Bulk Indexing Problems

- Overhead: Spent too much time starting indexing jobs

WordPress.com has 500 mil MySQL tables.

- High DB Load: Corner Cases. Blogs with 1+ mil followers.- High DB Load: Indexing sequentially doesn’t spread the load.- High DB Load: Heavy load on archive DBs.

Tuesday, February 25, 14

Bulk Indexing Today

12.0?

4 Days to Index all Posts(running right now)

Tuesday, February 25, 14

Real Time Indexing

The Hardest Part!

Tuesday, February 25, 14

Real Time Goals

1) Eventually Consistent

2) Minimize Bulk Re-indexing

3) Normally updated < 1 minute

Tuesday, February 25, 14

Real Time Goals

1) Eventually Consistent

2) Minimize Bulk Re-indexing

3) Normally updated < 1 minute

Bulk reindexed 3 times in 5 months.One intentional,

Two during system upgrades.Tuesday, February 25, 14

Stuff Fails

1) Humans

2) Hardware

3) Elasticsearch (steady improvements)

Combinations of the above.

Tuesday, February 25, 14

Hardware Problems

1) Detect and Track Down Servers

2) Prioritize Queries over Indexing

3) Throttle Indexing Jobs

- any issues: block bulk changes to blogs

- >10 min: block doc updates

- >20 min: block all indexing

Tuesday, February 25, 14

Real Time Failures

1) Auto Retry Failed Indexing Jobs

2) Indexing Queue for Failures

3) Scrolling Queries to Find Bad Docs

Tuesday, February 25, 14

Cluster Restarts

Indexing across replicas is non-deterministic

Segments diverge

Slows Restart TimeTuesday, February 25, 14

Simplistic Example

Segments w/ identical checksums

Docs

Primary

Replica

Shard 1 merges

Only first segment is identical

Tuesday, February 25, 14

After Bulk Index

Every segment is out of sync!

Tuesday, February 25, 14

Our Bulk Indexing Procedure

1) Bulk Index All Docs

2) Optimize the index

3) Rolling Restart (sync segments)

4) Future restarts will be much faster.

- Play with recovery settings

- SSDs? => use noop Linux scheduling

Tuesday, February 25, 14

Indexing

It’s all about handling Failures

Tuesday, February 25, 14

Overview of Related Posts

Our “10X Improvements” - Indexing - Querying

Our Open Issues

Tuesday, February 25, 14

Querying

Test and Iterate

Tuesday, February 25, 14

Related Posts Query

Started with MoreLikeThis API.

Did not scale well enough.

Tuesday, February 25, 14

MLT API

1) Get Document

2) Analyze Document

3) Search for Similar Docs

Tuesday, February 25, 14

MLT API vs MLT Query

MLT API MLT Query

147 req/sec 1062 req/sec

40% CPU 30% CPU

306 ms median latency 49.5 ms median latency

All processing by ES Build query in PHP

Tuesday, February 25, 14

Related Posts RelevancyGreat With Long Content

{ "more_like_this":{ "fields":["mlt_content"], "like_text":"Scaling Elasticsearch Part 1: Overview ElasticSearch scaling Search We recently launched Related Posts across WordPress.com, so its time to pop the hood and take a look at what ended up in our engine... ", "percent_terms_to_match":0.08, "boost_terms":5, "analyzer": "en_analyzer"}}

Tuesday, February 25, 14

MLT Query RelevancyUse match or multi_match for

short content.

Average Related Posts CTR

Tuesday, February 25, 14

Language Analyzers

arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, italian, japanese, korean, norwegian, persian, portuguese, romanian, russian, spanish, swedish, turkish, thai

Tuesday, February 25, 14

Related Posts Relevancy

How Important is using the

correct Language Analyzer?

Tuesday, February 25, 14

Related Posts Relevancy

How Important is using the

correct Language Analyzer?

Doubled Click Through Rate

Tuesday, February 25, 14

Unfortunately

Increased Slow Queries

(>1 second)

by 10x

still worth it.Tuesday, February 25, 14

Global Query Performancesearch.wordpress.com

Tuesday, February 25, 14

Parent-Child FilteringBlog Doc

Post Doc

public: true|false

title: “...”

content: “...”

Tuesday, February 25, 14

has_parent Filter

With has_parent Without has_parent

7.6 req/sec 17.5 req/sec

75% CPU 50% CPU

503 ms median latency 207 ms median latency

Requires more Indexing

Querying Across All Shards

Tuesday, February 25, 14

Indexing:

Optimize to Handle Failures

Querying:

Test and Iterate

Tuesday, February 25, 14

Overview of Related Posts

Our “10X Improvements” - Indexing - Querying

Our Open Issues

Tuesday, February 25, 14

Open Issues

Slow Queries (> 1 second)

Getting Better. Shards are too big.Tuesday, February 25, 14

Open Issues

What does it take to scale?

3x Data

5x Queries

Tuesday, February 25, 14

Open Issues

Elasticsearch for Natural

Language Processing?At Scale.

On Live Data.

Tuesday, February 25, 14

http://gibrown.wordpress.com@gregibrown

Feeling Inspired?http://automattic.com/work-with-us/data-wrangler/

Tuesday, February 25, 14

top related