hippo meetup: enterprise search with solr and elasticsearch
DESCRIPTION
Presentation used at the Hippo meetup about enterprise search which took place in Amsterdam. The talk started with a general introduction about search with lucene, scaling with Solr and the distributed problems that elasticsearch successfully addresses.TRANSCRIPT
15th January 2013 – Hippo meetup
[email protected] - @lucacavanna
Luca Cavanna
Software developer & Search consultant at Trifork Amsterdam
Trifork (aka Jteam/Dutchworks/Orange11)
Focus areas:– Big data & Search
– Mobile
– Custom solutions
– Knowledge (GOTO Amsterdam)
● Hippo partner
● Hippo related search projects:– uva.nl
– working on rijksoverheid.nl
Agenda
● Search introduction– Lucene foundation
– Why do we need Solr or elasticsearch?
● Scaling with Solr● Elasticsearch distributed nature● Elasticsearch features
Apache Lucene
● High-performance, full-featured text search engine library written entirely in Java
● It indexes documents as collections of fields
● A field is a string based key-value pair
● What data structure does it use under the hood?
Inverted index
1 The old night keeper keeps the keep in the town
2 In the big old house in the big old gown.
3 The house in the town had the big old keep
4 Where the old night keeper never did sleep.
5 The night keeper keeps the keep in the night
6 And keeps in the dark and sleeps in the light.
term freq Posting list
and 1 6
big 2 2 3
dark 1 6
did 1 4
grown 1 2
had 1 3
house 2 2 3
in 5 1 2 3 5 6
keep 3 1 3 5
keeper 3 1 4 5
keeps 3 1 5 6
light 1 6
never 1 4
night 3 1 4 5
old 4 1 2 3 4
sleep 1 4
sleeps 1 6
the 6 1 2 3 4 5 6
town 2 1 3
where 1 4
Inverted index
● Indexing– Text analysis
● Tokenization, lowercasing and more
● The inverted index can contain more data– Term offsets and more
● The inverted index itself doesn't contain the text for displaying the search results
Indexing
● Lucene writes indexes as segments● Segments are not modifiable: Write-Once● Each segment is a searchable mini index
● Each segment contains– Inverted index
– Stored fields
– ...and more
Indexing: the commit operation
● Documents are searchable only after a commit!
● Commit gives also durability
● The most expensive operation in Lucene!!!
Near-real-time search (since Lucene 2.9, exposed in Solr 4.0)
● With the Lucene near-real time API you don't need a commit to make new documents searchable
● Less expensive than commit
● Doesn't guarantee durability though
● Exposed as soft commit in Solr 4.0
Lucene code example – indexing data
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40)); Directory directory = FSDirectory.open(new File("data")); IndexWriter writer = new IndexWriter(directory, config); Document document = new Document(); FieldType idFieldType = new FieldType(); idFieldType.setIndexed(true); idFieldType.setStored(true); idFieldType.setTokenized(false); document.add(new Field("id","id-1", idFieldType)); FieldType titleFieldType = new FieldType(); titleFieldType.setIndexed(true); titleFieldType.setStored(true); document.add(new Field("title","This is the title", titleFieldType)); FieldType descriptionFieldType = new FieldType(); descriptionFieldType.setIndexed(true); document.add(new Field("description","This is the description", descriptionFieldType)); writer.addDocument(document); writer.close();
Lucene code example – querying and showing results
QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title", new StandardAnalyzer(Version.LUCENE_40)); Query query = queryParser.parse(queryAsString); Directory directory = FSDirectory.open(new File("data")); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("Total hits: " + topDocs.totalHits); for (ScoreDoc hit : topDocs.scoreDocs) { Document document = indexSearcher.doc(hit.doc); for (IndexableField field : document) { System.out.println(field.name() + ": " + field.stringValue()); } }
What's missing?
● A common way to represent documents● Interface to send document to (HTTP)● A way to represent queries● Interface to send queries to (HTTP)● Configuration● Caching● Distributed infrastructure● And more....
Enterprise search servers
Scaling – why?
‣ The more concurrent searches you run, the slower they get
‣ Indexing and searching on the same machine will substantially harm search performance
‣ Segment merging may be CPU/IO intensive operations
‣ Disk cache invalidation
‣ Fail over
Solr replication example
Solr replication (pull approach)
• Master-slave based solution
• Single machine for indexing data (master)
• Multiple machines for querying (slaves)
• Master is not aware of the slaves
• Slave is aware of the master
• Load balancer responsible for balancing the query requests
• What about real-time search? No way!
SolrCloud
• A set of new distributed capabilities in Solr
• uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election
• Whatever server (shard) you send data to:
• the documents get distributed over the shards
• A shard can be a leader or a replica and contains a subset of the data
• Easily scale up adding new Solr nodes
elasticsearch
● Distributed search engine built on top of Lucene● Apache 2 license● Written in Java● RESTful● Created and mainly developed by Shay Banon● A company behind it: elasticsearch.com● Regular releases
– Latest release 0.20.2
elasticsearch
● Schemaless
– Uses defaults and automatic type guessing
– Custom mappings may be defined if needed
● JSON oriented● Multi tenancy
– Multiple indexes per node, multiple types per index
● Designed to be distributed from the beginning● Almost everything is available as API (including
configuration)● Wide range of administration APIs
elasticsearch distributed terminology
● Node: a running instance of elasticsearch which belongs to a cluster (usually one node per server)
● Cluster: one or more nodes with the same cluster name● Shard: a single Lucene instance. A low-level worker unit
managed by elasticsearch. An index is split into one or more shards.
● Index: a logical namespace which points to one or more shards
– Your code won't deal directly with a shard, only with an index
– But an index is composed of more lucene indexes (one per shard)
elasticsearch distributed terminology
● More shards:
– improve indexing performance
– increase data distribution (depends on # of nodes)
– Watch out: each shard has a cost as well!
● More replicas:
– increase failover
– improve querying performance
Transaction Log
• Indexed docs are fully persistent
• No need for a Lucene IndexWriter#commit
• Managed using a transaction log / WAL
• Full single node durability (kill dash 9)
• Utilized when doing hot relocation of shards
• Periodically “flushed” (calling IW#commit)
• Durability and real time search together!
Index - Shards & Replicas
NodeNode NodeNode
ClientClient
curl -XPUT localhost:9200/hippo -d '{ "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 }}'
Index - Shards & Replicas
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 1Shard 1(replica)(replica)
NodeNode
Shard 0Shard 0(replica)(replica)
Shard 1Shard 1(primary)(primary)
ClientClient
curl -XPUT localhost:9200/hippo -d '{ "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 }}'
Indexing - 1
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 1Shard 1(replica)(replica)
NodeNode
Shard 0Shard 0(replica)(replica)
Shard 1Shard 1(primary)(primary)
ClientClient
• Automatic sharding, push replication
curl -XPUT localhost:9200/hippo/users/1 -d '{ "name" : { "first" : "Luca", "last" : "Cavanna" }}'
Indexing - 2
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 1Shard 1(replica)(replica)
NodeNode
Shard 0Shard 0(replica)(replica)
Shard 1Shard 1(primary)(primary)
ClientClient
curl -XPUT localhost:9200/hippo/users/2 -d '{ "name" : { "first" : "Jeroen", "last" : "Reijn" }}'
Search - 1
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 1Shard 1(replica)(replica)
NodeNode
Shard 0Shard 0(replica)(replica)
Shard 1Shard 1(primary)(primary)
ClientClient
curl -XPUT localhost:9200/hippo/_search?q=luca
• Scatter / Gather search
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 1Shard 1(replica)(replica)
NodeNode
Shard 0Shard 0(replica)(replica)
Shard 1Shard 1(primary)(primary)
ClientClient
curl -XPUT localhost:9200/hippo/_search?q=luca
• Automatic balancing between replicas
Search - 2
Search - 3
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 1Shard 1(replica)(replica)
NodeNode
Shard 0Shard 0(replica)(replica)
Shard 1Shard 1(primary)(primary)
ClientClient
curl -XPUT localhost:9200/hippo/_search?q=luca
failure
• Automatic failover
Adding a node
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 1Shard 1(replica)(replica)
NodeNode
Shard 1Shard 1(primary)(primary)
Shard 0Shard 0(replica)(replica)
• “Hot” reallocation of shards to the new node
Adding a node
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 1Shard 1(replica)(replica)
NodeNode
Shard 1Shard 1(primary)(primary)
NodeNode
Shard 0Shard 0(replica)(replica)
• “Hot” reallocation of shards to the new node
Adding a node
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 1Shard 1(replica)(replica)
NodeNode
Shard 1Shard 1(primary)(primary)
NodeNode
Shard 0Shard 0(replica)(replica)
Shard 0Shard 0(replica)(replica)
• “Hot” reallocation of shards to the new node
Node failure
NodeNode
Shard 1Shard 1(primary)(primary)
NodeNode
Shard 0Shard 0(replica)(replica)
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 1Shard 1(replica)(replica)
Node failure - 1
NodeNode
Shard 1Shard 1(primary)(primary)
NodeNode
Shard 0Shard 0(primary)(primary)
• Replicas can automatically become primaries
Node failure - 2
NodeNode
Shard 1Shard 1(primary)(primary)
NodeNode
Shard 0Shard 0(primary)(primary)
Shard 0Shard 0(replica)(replica)
Shard 1Shard 1(replica)(replica)
• Shards are automatically assigned and do “hot” recovery
Dynamic Replicas
NodeNode
Shard 0Shard 0(primary)(primary)
NodeNode
Shard 0Shard 0(replica)(replica)
ClientClient
curl -XPUT localhost:9200/hippo -d '{ "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 }}'
NodeNode
Dynamic Replicas
NodeNode
Shard 0Shard 0(primary)(primary)
NodeNode NodeNode
Shard 0Shard 0(replica)(replica)
ClientClient
Shard 0Shard 0(replica)(replica)
curl -XPUT localhost:9200/hippo -d '{ "index" : { "number_of_replicas" : 2 }}'
Indexing (Push) - ElasticSearch
• Documents added through push requests
• Full JSON Object representation of Documents supported
• Embedded objects
• 1st class Parent / Child and Versioning
• Near Realtime index refreshing available
• Realtime get supported { "name": "Luca Cavanna", "location": { "city": "Amsterdam", "country": "The Netherlands" }}
Indexing (Pull) - ElasticSearch
• Data flows from sources using ‘Rivers’
• Continues to add data as it ‘flows’
• Can be added, removed, configured dynamically
• Out-of-the-box support for CouchDB, Twitter (implemented by the es team)
• Community implementations for DBs, other NoSQL and Solr
RiverRiver
RiverRiver
Searching - ElasticSearch
• Search request in Request Body
• Powerful and extensible Query DSL
• Separation of Query and Filters
• Named Filters allowing tracking of which Documents matched which Filters
• By default storing the source of each document (_source field)
• Catch all feature enabled by default (_all field)
• Sorting of results
• Highlighting, Faceting, Boosting...and more
Search Example - ElasticSearch
$ curl -XGET 'http://localhost:9200/hippo/users/_search' -d '{ "query" : { "term" : { "first_name" : "luca" } }}'
{ "_shards": { "total" : 5, "successful" : 5, "failed" : 0 }, "hits": { "total" : 1, "hits" : [ { "_index" : "hippo", "_type" : "users", "_id" : "1", "_source" : { "first_name" : "Luca", "last_name" : "Cavanna" } } ] }}
Thanks
There would be a lot more to say:
• Query DSL
• Scripting module (pluggable implementation)
• Percolator
• Running it embedded
Check them out yourself if you are interested!
Questions?