hippo meetup: enterprise search with solr and elasticsearch

15th January 2013 – Hippo meetup

[email protected] - @lucacavanna

Luca Cavanna

Software developer & Search consultant at Trifork Amsterdam

mailto:[email protected]

http://twitter.com/lucacavanna

http://trifork.nl/

Trifork (aka Jteam/Dutchworks/Orange11)

Focus areas:– Big data & Search

– Mobile

– Custom solutions

– Knowledge (GOTO Amsterdam)

● Hippo partner

● Hippo related search projects:– uva.nl

– working on rijksoverheid.nl

http://uva.nl/

http://rijksoverheid.nl/

Agenda

● Search introduction– Lucene foundation

– Why do we need Solr or elasticsearch?

● Scaling with Solr● Elasticsearch distributed nature● Elasticsearch features

Apache Lucene

● High-performance, full-featured text search engine library written entirely in Java

● It indexes documents as collections of fields

● A field is a string based key-value pair

● What data structure does it use under the hood?

Inverted index

1 The old night keeper keeps the keep in the town

2 In the big old house in the big old gown.

3 The house in the town had the big old keep

4 Where the old night keeper never did sleep.

5 The night keeper keeps the keep in the night

6 And keeps in the dark and sleeps in the light.

term freq Posting list

and 1 6

big 2 2 3

dark 1 6

did 1 4

grown 1 2

had 1 3

house 2 2 3

in 5 1 2 3 5 6

keep 3 1 3 5

keeper 3 1 4 5

keeps 3 1 5 6

light 1 6

never 1 4

night 3 1 4 5

old 4 1 2 3 4

sleep 1 4

sleeps 1 6

the 6 1 2 3 4 5 6

town 2 1 3

where 1 4

Inverted index

● Indexing– Text analysis

● Tokenization, lowercasing and more

● The inverted index can contain more data– Term offsets and more

● The inverted index itself doesn't contain the text for displaying the search results

Indexing

● Lucene writes indexes as segments● Segments are not modifiable: Write-Once● Each segment is a searchable mini index

● Each segment contains– Inverted index

– Stored fields

– ...and more

Indexing: the commit operation

● Documents are searchable only after a commit!

● Commit gives also durability

● The most expensive operation in Lucene!!!

Near-real-time search (since Lucene 2.9, exposed in Solr 4.0)

● With the Lucene near-real time API you don't need a commit to make new documents searchable

● Less expensive than commit

● Doesn't guarantee durability though

● Exposed as soft commit in Solr 4.0

Lucene code example – indexing data

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40)); Directory directory = FSDirectory.open(new File("data")); IndexWriter writer = new IndexWriter(directory, config); Document document = new Document(); FieldType idFieldType = new FieldType(); idFieldType.setIndexed(true); idFieldType.setStored(true); idFieldType.setTokenized(false); document.add(new Field("id","id-1", idFieldType)); FieldType titleFieldType = new FieldType(); titleFieldType.setIndexed(true); titleFieldType.setStored(true); document.add(new Field("title","This is the title", titleFieldType)); FieldType descriptionFieldType = new FieldType(); descriptionFieldType.setIndexed(true); document.add(new Field("description","This is the description", descriptionFieldType)); writer.addDocument(document); writer.close();

Lucene code example – querying and showing results

QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title", new StandardAnalyzer(Version.LUCENE_40)); Query query = queryParser.parse(queryAsString); Directory directory = FSDirectory.open(new File("data")); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("Total hits: " + topDocs.totalHits); for (ScoreDoc hit : topDocs.scoreDocs) { Document document = indexSearcher.doc(hit.doc); for (IndexableField field : document) { System.out.println(field.name() + ": " + field.stringValue()); } }

What's missing?

● A common way to represent documents● Interface to send document to (HTTP)● A way to represent queries● Interface to send queries to (HTTP)● Configuration● Caching● Distributed infrastructure● And more....

Enterprise search servers

Scaling – why?

‣ The more concurrent searches you run, the slower they get

‣ Indexing and searching on the same machine will substantially harm search performance

‣ Segment merging may be CPU/IO intensive operations

‣ Disk cache invalidation

‣ Fail over

Solr replication example

Solr replication (pull approach)

• Master-slave based solution

• Single machine for indexing data (master)

• Multiple machines for querying (slaves)

• Master is not aware of the slaves

• Slave is aware of the master

• Load balancer responsible for balancing the query requests

• What about real-time search? No way!

SolrCloud

• A set of new distributed capabilities in Solr

• uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election

• Whatever server (shard) you send data to:

• the documents get distributed over the shards

• A shard can be a leader or a replica and contains a subset of the data

• Easily scale up adding new Solr nodes

elasticsearch

● Distributed search engine built on top of Lucene● Apache 2 license● Written in Java● RESTful● Created and mainly developed by Shay Banon● A company behind it: elasticsearch.com● Regular releases

– Latest release 0.20.2

http://elasticsearch.com/

elasticsearch

● Schemaless

– Uses defaults and automatic type guessing

– Custom mappings may be defined if needed

● JSON oriented● Multi tenancy

– Multiple indexes per node, multiple types per index

● Designed to be distributed from the beginning● Almost everything is available as API (including

configuration)● Wide range of administration APIs

elasticsearch distributed terminology

● Node: a running instance of elasticsearch which belongs to a cluster (usually one node per server)

● Cluster: one or more nodes with the same cluster name● Shard: a single Lucene instance. A low-level worker unit

managed by elasticsearch. An index is split into one or more shards.

● Index: a logical namespace which points to one or more shards

– Your code won't deal directly with a shard, only with an index

– But an index is composed of more lucene indexes (one per shard)

elasticsearch distributed terminology

● More shards:

– improve indexing performance

– increase data distribution (depends on # of nodes)

– Watch out: each shard has a cost as well!

● More replicas:

– increase failover

– improve querying performance

Transaction Log

• Indexed docs are fully persistent

• No need for a Lucene IndexWriter#commit

• Managed using a transaction log / WAL

• Full single node durability (kill dash 9)

• Utilized when doing hot relocation of shards

• Periodically “flushed” (calling IW#commit)

• Durability and real time search together!

Index - Shards & Replicas

NodeNode NodeNode

ClientClient

curl -XPUT localhost:9200/hippo -d '{ "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 }}'

Index - Shards & Replicas

NodeNode

Shard 0Shard 0(primary)(primary)

Shard 1Shard 1(replica)(replica)

NodeNode



ClientClient


Indexing - 1

NodeNode



NodeNode



ClientClient

• Automatic sharding, push replication

curl -XPUT localhost:9200/hippo/users/1 -d '{ "name" : { "first" : "Luca", "last" : "Cavanna" }}'

Indexing - 2

NodeNode



NodeNode



ClientClient

curl -XPUT localhost:9200/hippo/users/2 -d '{ "name" : { "first" : "Jeroen", "last" : "Reijn" }}'

Search - 1

NodeNode



NodeNode



ClientClient

curl -XPUT localhost:9200/hippo/_search?q=luca

• Scatter / Gather search

NodeNode



NodeNode



ClientClient


• Automatic balancing between replicas

Search - 2

Search - 3

NodeNode



NodeNode



ClientClient


failure

• Automatic failover

Adding a node

NodeNode



NodeNode



• “Hot” reallocation of shards to the new node

Adding a node

NodeNode



NodeNode


NodeNode



Node failure

NodeNode


NodeNode


NodeNode



Node failure - 1

NodeNode


NodeNode


• Replicas can automatically become primaries

Node failure - 2

NodeNode


NodeNode




• Shards are automatically assigned and do “hot” recovery

Dynamic Replicas

NodeNode


NodeNode


ClientClient


NodeNode

Dynamic Replicas

NodeNode


NodeNode NodeNode


ClientClient


curl -XPUT localhost:9200/hippo -d '{ "index" : { "number_of_replicas" : 2 }}'

Indexing (Push) - ElasticSearch

• Documents added through push requests

• Full JSON Object representation of Documents supported

• Embedded objects

• 1st class Parent / Child and Versioning

• Near Realtime index refreshing available

• Realtime get supported { "name": "Luca Cavanna", "location": { "city": "Amsterdam", "country": "The Netherlands" }}

Indexing (Pull) - ElasticSearch

• Data flows from sources using ‘Rivers’

• Continues to add data as it ‘flows’

• Can be added, removed, configured dynamically

• Out-of-the-box support for CouchDB, Twitter (implemented by the es team)

• Community implementations for DBs, other NoSQL and Solr

RiverRiver

RiverRiver

Searching - ElasticSearch

• Search request in Request Body

• Powerful and extensible Query DSL

• Separation of Query and Filters

• Named Filters allowing tracking of which Documents matched which Filters

• By default storing the source of each document (_source field)

• Catch all feature enabled by default (_all field)

• Sorting of results

• Highlighting, Faceting, Boosting...and more

Search Example - ElasticSearch

$ curl -XGET 'http://localhost:9200/hippo/users/_search' -d '{ "query" : { "term" : { "first_name" : "luca" } }}'

{ "_shards": { "total" : 5, "successful" : 5, "failed" : 0 }, "hits": { "total" : 1, "hits" : [ { "_index" : "hippo", "_type" : "users", "_id" : "1", "_source" : { "first_name" : "Luca", "last_name" : "Cavanna" } } ] }}

Thanks

There would be a lot more to say:

• Query DSL

• Scripting module (pluggable implementation)

• Percolator

• Running it embedded

Check them out yourself if you are interested!

Questions?

hippo meetup: enterprise search with solr and elasticsearch

Technology