elasticsearch first-steps
DESCRIPTION
Elasticsearch: first steps with an aggregate-oriented databaseTRANSCRIPT
Elasticsearch: first steps with an
Aggregate-oriented database
Jug Roma 28/11/2013
Matteo Moci
Me
Matteo Moci
@matteomoci
http://mox.fm
Software Engineer
R&D, new product development
Agenda
• 2 Use cases
• Elasticsearch Basics
• Data Design for scaling
Social Media Analytics Platform
for Marketing Agencies
Scenario
•Using Elasticsearch as:
•Analytics engine
•Aggregate repository
Use case 1
• count values distribution over time
Before
• ~10M documents
•Heaviest query:
• ~10 minutes
•Our staff had a problem
After
• ~10M documents
•Heaviest query:
• ~1 second (also with larger dataset)
Use case 2
• Aggregate-oriented repository
• ...as in DDD
http://ptgmedia.pearsoncmg.com/images/chap10_9780321834577/elementLinks/10fig05.jpg
ElasticsearchDistributed RESTful search and analytics
real time data and analytics
distributed
high availability
multi tenancy
full-text search
schema free
RESTful, JSON API
Elasticsearch basics
• Install• API• Types mapping• Facets• Relations
Install
$ wget https://download.elasticsearch.org/... $ tar -xf elasticsearch-0.90.7.tar.gz
Run!
es
Run!$ ./elasticsearch-0.90.7/bin/elasticsearch -f
Hulk
es
Run!$ ./elasticsearch-0.90.7/bin/elasticsearch -f
Hulk
$ ./elasticsearch-0.90.7/bin/elasticsearch -f
es
Run!$ ./elasticsearch-0.90.7/bin/elasticsearch -f
Hulk Thor
$ ./elasticsearch-0.90.7/bin/elasticsearch -f
Index a document
$ curl -X PUT localhost:9200/products/product/1 -d '{
"name" : "Camera" }'
Search
$ curl‐XGET 'localhost:9200/products/product/_search?q=Camera'
esHulk
Products
1 2
1 2
Shards and Replicas
esThorHulk
Products
1 2
1 2
Shards and Replicas
esThor
Products
Hulk
Products
1 2
1 2
Shards and Replicas
esThor
Products
Hulk
Products
1 2
1 2
Shards and Replicas
esThor
Products
Hulk
Products
1 2
12
Shards and Replicas
Integration
Hulk9300
Thor9300
Integration
Hulk
TransportClient
9300Thor
9300
Async Java APIthis.client.prepareGet("documents", "document", id) //async, non blocking APIs //use a listener to handle result. non-blocking .execute(new ActionListener<GetResponse>() { @Override public void onResponse(GetResponse getFields)
{ // }
@Override public void onFailure(Throwable e) { // }
Mapping
Mappings define how primitive types are stored and analyzed
Mapping• JSON data is parsed on indexing• Mapping is done on first field indexing• Inferred if not configured (!)• Types: float, long, boolean, date
(+formatting), object, nested• String type can have arbitrary analyzers• Fields can be split up in more fields
{ "text": { "type": "multi_field", "fields": { "text": { "type": "string", "index": "analyzed", "index_analyzer": "whitespace", "analyzer": "whitespace" }, "text_bigram": { "type": "string", "index": "analyzed", "index_analyzer": "bigram_analyzer", "search_analyzer": "bigram_analyzer" }, "text_trigram": { "type": "string", "index": "analyzed", "index_analyzer": "trigram_analyzer", "search_analyzer": "trigram_analyzer" } } }}
Mapping - lessons
• schema can evolve (e.g. add fields)• inferred if not specified (!)• worst case: reindex• use aliases to enable zero downtime
Search with Facetsfinal TermsFacetBuilder userFacet = FacetBuilders.termsFacet(MENTION_FACET_NAME) .field(USER_ID).size(maxUsersAmount);
SearchResponse response; response = client.prepareSearch(Indices.USERS) .setTypes(USER_TYPE) .setQuery(someQuery).setSize(0) .setSearchType(SearchType.COUNT)
.addFacet(userFacet).execute().actionGet();
final TermsFacet facets = (TermsFacet) response.getFacets().facetsAsMap() .get(MENTION_FACET_NAME);
Query
Facets
Date Histogram Facet
The histogram facet works with numeric data by building a histogram across intervals of the field values.
Each value is placed in a “bucket”
{ "query" : { "match_all" : {} }, "facets" : { "histo1" : { "histogram" : { "field" : "followers", "interval" : 10 } } }}
Facets - lessonsBug in 0.90.x:
• https://github.com/elasticsearch/elasticsearch/issues/1305*
Solutions: • use 1 shard• ask for top 100 instead of 10
*will be solved in 1.0 with aggregation module
Analyzers
A Lucene analyzer consists of a tokenizer and an arbitrary amount of filters (+ char filters)
{ "index":{ "analysis":{ "filter":{ "bigram_shingle_filter":{ "type":"shingle", "max_shingle_size":2, "min_shingle_size":2, "output_unigrams":"false", "output_unigrams_if_no_shingles":"false" }, "trigram_shingle_filter":{ "type":"shingle", "max_shingle_size":3, "min_shingle_size":3, "output_unigrams":"false", "output_unigrams_if_no_shingles":"false" } } ...
..."analyzer":{ "bigram_analyzer":{ "tokenizer":"whitespace", "filter":[ "standard", "bigram_shingle_filter" ] }, "trigram_analyzer":{ "tokenizer":"whitespace", "filter":[ "standard", "trigram_shingle_filter" ] } } } }}
Relations between Documents
BookAuthorN1
• nested: faster reads, update needs reindex, cross object match• parent/child: same shard, no reindex on update, difficult sorting
Nested Documents
Specify Book type is “nested” in Author’s Mapping
We can query Authors with a query on properties of nested Books
“Authors who published at least a book with Penguin, in scifi genre”
curl -XGET localhost:9200/authors/nested_author/_search -d '{ "query": { "filtered": { "query": {"match_all": {}}, "filter": { "nested": { "path": "books", "query":{ "filtered": { "query": { "match_all": {}}, "filter": { "and": [ {"term": {"books.publisher": "penguin"}}, {"term": {"books.genre": "scifi"}} ] } } } } } } }}'
Parent and Child
Indexing happens separately
Specify _parent type in Child mapping (Book)
When indexing Books, specify id of Author
curl -XPOST localhost:9200/authors/book/_mapping -d '{ "book":{ "_parent": {"type": "bare_author"} }}'
curl -XPOST localhost:9200/authors/book/1?parent=2 -d '{ "name": "Revelation Space", "genre": "scifi", "publisher": "penguin"}'
Parent and Child - query
curl -XPOST localhost:9200/authors/bare_author/_search -d '{ "query": { "has_child": { "type": "book", "query" : { "filtered": { "query": { "match_all": {}}, "filter" : { "and": [ {"term": {"publisher": "penguin"}}, {"term": {"genre": "scifi"}} ] } } } } }}'
Data DesignIndex Configurations
• One index “per user”• Single index• SI + Routing: 1 index + custom doc routing
to shards• Time: 1 index per time window *
* we can search across indices
One Index per userHulk Thor
User1 s0 User1 s1
User2 s0
+ different sharding per user- small users own (and cost) at least 1 shard
Single IndexHulk Thor
Users s0 Users s3
+ filter by user id, support growth- search hits all shards
Users s2
Single Index + routingHulk Thor
Users s0 Users s3
+ a user’s data is all in one shard, allows large overallocation
Users s2
Index per time rangeHulk Thor
2013_01 s1 2013_01 s2
+ allows change in future indices
2013_02 s1
Data Design - lessonsTest, test, test your use case!
Take a single node with one shard and throw load at it, checking the shard capacity
The shard is the scaling unit: overallocate to enable future scaling
#shards > #nodes
...ES has lots of other features!
• Bulk operations• Percolator (alerts, classification, …) • Suggesters (“Did you mean …?”) • Index templates (Automatic index
configuration) • Monitoring API (Amount of memory used,
number of operations, …)• Plugins• ...