elasticsearch sharding strategy at tubular labs

Post on 15-Apr-2017

263 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Elasticsearch Sharding Strategy at Tubular LabsHow we arrived at a sharding strategy

Our Elasticsearch Infrastructure?

• 3 clusters for search/aggregations

• 1 small autocomplete cluster

• 1 medium sized cluster for internal use

• 1 Elastic Stack cluster

Our Elasticsearch Clusters

© 2016 Tubular Labs

3

• 2.5 billion documents

• 4TB not including replicas

• Constant indexing load with periodic spikes

• Queries range from simple search request to heavy terms aggregations

• Not many concurrent queries, but queries can be demanding

• Cluster is very CPU heavy

• Recently migrated from Elasticsearch 1.7 to 2.3

Our Largest Cluster

© 2016 Tubular Labs

4

• We have to reindex anyway

• Our dataset has grown substantially

• Performance wasn’t great

• We don’t want to have to reindex in the near future

Migrating to 2.x is a good time to reconsider sharding

© 2016 Tubular Labs

5

Sharding Strategy

● How many shards should I have per index?

● How large should my shards be?

● How many shards should I have per node?

● What hardware/instance type should I use?

Sharding Questions...

© 2016 Tubular Labs

7

• How large is your dataset?

• How fast will your dataset grow?

• What kinds of queries are you running?

• How fast will usage grow?

• When do you want to reindex next?

• I’m sure there are more...

It Depends...

© 2016 Tubular Labs

8

How do we get answers?

© 2016 Tubular Labs

9

Repeatable Elasticsearch Experiments

What We Want

• Repeatable• Others can easily run the same tests and should get about the same results

• Easily modified

• Easy to define and understand

• Easy to run

• understandable results

Repeatable Elasticsearch Experiments:

© 2016 Tubular Labs

11

• Benchmarking framework for Elasticsearch

• Easily define a set of repeatable tests• Tests are defined in JSON

• Compare different configurations

• Sets up a single node cluster for tests or

target existing (external) clusters

• Targeting external clusters is not fully supported

and you’ll get warnings telling you as much

What is Rally?

© 2016 Tubular Labs

12

Terms•Track - a benchmarking scenario

•Car - system (Elasticsearch) configuration for a

benchmark

•Challenge - what benchmarks are run and its

configuration

•Race - an actual run of the benchmark

•Tournaments - A way to analyze the impact of

changes

What is Rally?

© 2016 Tubular Labs

13

Example track config

https://gist.github.com/mdelaney/b710fb3d25fabf7818f471bd4abe70a5

How does Rally work?

© 2016 Tubular Labs

14

Our Experiments and Results

NOTE: The following experiments are written as we would do them next time. Due to time constraints we had to do some of this in parallel. I’ll also mention where we deviated from what is in the next few slides.

• We’re still pretty new at running benchmarks with Elasticsearch so we’re still learning the

best way to do this.

• Running these tests answered a lot of questions (and raised brand new ones)

How we used this at Tubular Labs

© 2016 Tubular Labs

16

How big should my shards be?

Determining a good shard size

© 2016 Tubular Labs

17

The experiment

1. Obtain a realistic data set

2. Write the Rally config to:• Index your data (single shard)

• Run a set of common queries

3. Run benchmark with different document counts

4. Graph the results

Determining a good shard size

© 2016 Tubular Labs

18

The queries we used

• Query A and B:• Very similar but aggregate on a slightly different set of terms

• Hits about 10% of our dataset

• Query C and D:• Same aggregations as queries A and B

• Full dataset

Determining a good shard size

© 2016 Tubular Labs

19

Our results

Determining a good shard size

© 2016 Tubular Labs

20

We need to consider

• How fast do you need each query to be?

• How much do you expect your data set to grow before you want to look at reindexing

again?

• Your use case likely will have other concerns as well

Determining a good shard size

© 2016 Tubular Labs

21

How many shards per node?

Determining how many shards per node

© 2016 Tubular Labs

22

The experiment (almost the same as before)

1. Obtain a dataset of realistic data

2. Write the Rally config to:• Index your data

• Run a set of common queries

3. Run benchmark with different shard counts

4. Graph the results

Determining how many shards per node

© 2016 Tubular Labs

23

What we did differently this time (time constraints)

• Used the Apache HTTP Benchmark Tool with a script to run the queries.

• Our production cluster had 26 data nodes with about 200 million documents each

• Wanted to avoid expanding the cluster further if at all possible (c3.8xlarge is pricey!)• 10 total shards per node (about 20 million docs/shard)

• 16 total shards per node (about 12.5 million docs/shard)

• 32 total shards per node (about 6.25 million docs/shard)

• Tested on 3 node clusters (2 data nodes, 1 client/master)

Determining how many shards per node

© 2016 Tubular Labs

24

Our Results - Testing Number of Shards per node

Query response by shard count (C 1) Query response by shard count (C 3)

© 2016 Tubular Labs

25

Our Results - Testing Number of Shards per node

Query response production vs test (C 1) Query response production vs test (C 3)

© 2016 Tubular Labs

26

Production - 26 data nodes

Test Cluster - 2 data nodes

• Significant performance drop in each level of testing, why?

• A single shard on a single node performed much better than our

multiple shards per node tests

• The fully loaded 3 node cluster performed much better than our full

cluster in production

• Impact of moving to a machine with more memory• Will the extra file system cache make a large difference?

New Questions Raised

© 2016 Tubular Labs

27

Query load isn’t evenly distributed

Current path of performance investigation

© 2016 Tubular Labs

28

1 4

3* 2*

5* 8*

10 13*

11 6*

2 5

7* 4*

10* 9*

11* 12*

14 15

3 6

1* 9

13 8

12 7

15* 14*

Problems We Encountered

Rally related

• Document count in track.json != the

document count Rally checks at the end

of indexing with nested documents.

• Multi node support not yet available

Problems We Encountered?

© 2016 Tubular Labs

30

Non Rally related

•Performance in reality wasn’t as good as our testing suggested it should be• We haven’t found the reason for this yet

• We’ve noticed a correlation between the number of shards a query hits per node and the time taken to run the

query on the shard but have not yet identified the bottleneck.

• We were able to mitigate this by adding additional data nodes

Problems We Encountered?

© 2016 Tubular Labs

31

Thank You!Questions??

top related