recent additions to lucene arsenal
DESCRIPTION
Presented by Shai Erera, Researcher, IBM Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.TRANSCRIPT
![Page 1: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/1.jpg)
![Page 2: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/2.jpg)
Recent Additions to Lucene’s Arsenal
Shai Erera, Researcher, IBM
Adrien Grand, ElasticSearch
![Page 3: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/3.jpg)
• Shai Erera– Working at IBM – Information Retrieval Research– Lucene/Solr committer and PMC member– http://shaierera.blogspot.com– [email protected]
• Adrien Grand– @jpountz– Lucene/Solr committer and PMC member– Software engineer at Elasticsearch
Who We Are
![Page 4: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/4.jpg)
The Replicator
![Page 5: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/5.jpg)
Load Balancing
Load
Balancer
![Page 6: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/6.jpg)
Failover
![Page 7: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/7.jpg)
Index Backup
![Page 8: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/8.jpg)
The Replicator
Primary
Backup
Backup
http://shaierera.blogspot.com/2013/05/the-replicator.html
Re
plic
ato
r Re
plic
atio
nC
lien
tR
ep
lica
tion
Clie
nt
![Page 9: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/9.jpg)
• Replicator– Mediates between the client and server– Manages the published Revisions– Implementation for replication over HTTP
• Revision– Describes a list of files and metadata– Responsible to ensure the files are available as long as clients replicate it
• ReplicationClient– Performs the replication operation on the replica side– Copies delta files and invokes ReplicationHandler upon successful copy– Always replicates latest revision
• ReplicationHandler– Acts on the copied files
Replication Components
![Page 10: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/10.jpg)
• IndexRevision– Obtains a snapshot on the last commit through SnapshotDeletionPolicy– Released when revision is released by Replicator
• IndexReplicationHandler– Copies the files to the index directory and fsync them– Aborts (rollback) on any error– Upon successful completion, invokes a callback (e.g.
SearcherManager.maybeRefresh())
• Similar extensions for faceted index replication– IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy
indexes– IndexAndTaxonomyReplicationHandler: copies the files to the respective
directories, keeping both in sync
Index Replication
![Page 11: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/11.jpg)
Sample Code
// Server-side: publish a new RevisionReplicator replicator = new LocalReplicator();replicator.publish(new IndexRevision(indexWriter));
// Client-side: replicate a RevisionReplicator replicator; // either LocalReplicator or HttpReplicator
// refresh SearcherManager after index is updatedCallable<Boolean> callback = new Callable<Boolean>() { public Boolean call() throws Exception { // index was updated, refresh manager searcherManager.maybeRefresh(); }}
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);ReplicationClient client = new ReplicationClient(replicator, handler, factory);
client.updateNow(); // invoke client manually// -- OR --client.startUpdateThread(30000); // check for updates every 30 seconds
![Page 12: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/12.jpg)
• Resume– Session level: don’t copy files that were already successfully copied– File level: don’t copy file parts that were already successfully copied
• Parallel Replication– Copy revision files in parallel
• Other replication strategies– Peer-to-peer
Future Work
![Page 13: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/13.jpg)
Index SortingHow to trade index speed for search speed
![Page 14: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/14.jpg)
Index = collection of immutable segments
Segments store documents sequentially on disk
Add data = create a new segment
Segments get eventually merged together
Order of segments / documents in segments doesn’t matter– the following segments are equivalent
Anatomy of a Lucene index
9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13
1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12IdPrice
13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0
12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3IdPrice
![Page 15: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/15.jpg)
ordinal of a doc in a segment = doc id
used in the inverted index to refer to docs
Anatomy of a Lucene index
9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13
1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12Id
Price
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16doc id
shoe 1, 3, 5, 8, 11, 13, 15
![Page 16: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/16.jpg)
Get top N=2 results:– Create a priority queue of size N– Accumulate matching docs
Top hits
9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13
1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12IdPrice
(3)() (3,4) (4,20) (4,9) (4,9) (9,31) (9,31)
Automatic overflow of the priority queue to remove the
least one
Create an empty priority queue
Top hits
![Page 17: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/17.jpg)
Let’s do the same on a sorted index
Early termination
13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0
12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3IdPrice
(9)() (9,31) (9,31) (9,31) (9,31) (9,31) (9,31)
Priority queue never changes after this
document
![Page 18: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/18.jpg)
Pros– makes finding the top hits much faster– file-system cache-friendly
Cons– only works for static ranks
– not if the sort order depends on the query– requires the index to be sorted– doesn’t work for tasks that require visiting every doc:
– total number of matches– faceting
Early termination
![Page 19: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/19.jpg)
Not uncommon!
Graph-based ranks– Google’s PageRank
Facebook social search / Unicorn– https://www.facebook.com/publications/219621248185635
Many more...
Doesn’t need to be the exact sort order– heuristics when score is only a function of the static rank
Static ranks
![Page 20: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/20.jpg)
A live index can’t be kept sorted– would require inserting docs between existing docs!– segments are immutable
Offline sorting to the rescue:– index as usual– sort into a new index– search!
Pros/cons– super fast to search, the whole index is fully sorted– but only works for static content
Offline sorting
![Page 21: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/21.jpg)
Offline Sorting
// open a reader on the unsorted index and create a sorted (but slow) viewDirectoryReader reader = DirectoryReader.open(in);boolean ascending = false;Sorter sorter = new NumericDocValuesSorter("price", ascending);AtomicReader sortedReader = SortingAtomicReader.wrap( SlowCompositeReaderWrapper.wrap(reader), sorter);
// copy the content of the sorted reader to the new dirIndexWriter writer = new IndexWriter(out, iwConf);writer.addIndexes(sortedReader);writer.close();reader.close();
![Page 22: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/22.jpg)
Sort segments independently– wouldn’t require inserting data into existing segments– collection could still be early-terminated on a per-segment basis
But segments are immutable– must be sorted before starting writing them
Online sorting?
![Page 23: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/23.jpg)
2 sources of segments– flush– merge
flushed segments can’t be sorted– Lucene writes stored fields to disk on the fly– could be buffered but this would require a lot of memory
merged segments can be sorted– create a sorted view over the segments to merge– pass this view to SegmentMerger instead of the original segments
not a bad trade-off– flushed segments are usually small & fast to collect
Online sorting?
![Page 24: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/24.jpg)
Online sorting?
Flushed segments - NRT reopens - RAM buffer size limit hit
Merged segments
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
Merged segments can easily take 99+% of the size of the index
![Page 25: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/25.jpg)
Online Sorting
IndexWriterConfig iwConf = new IndexWriterConfig(...);
// original MergePolicy finds the segments to mergeMergePolicy origMP = iwConf.getMergePolicy();
// SortingMergePolicy wraps the segments with a sorted viewboolean ascending = false;Sorter sorter = new NumericDocValuesSorter("price", ascending);MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter);
// setup IndexWriter to use SortingMergePolicyiwConf.setMergePolicy(sortingMP);IndexWriter writer = new IndexWriter(dir, iwConf);
// index as usual
![Page 26: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/26.jpg)
Collect top N matches
Offline sorting– index sorted globally– early terminate after N matches have been collected– no priority queue needed!
Online sorting– no early termination on flushed segments– early termination on merged segments
– if N matches have been collected– or if current match is less than the top of the PQ
Early termination
![Page 27: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/27.jpg)
Early Termination
class MyCollector extends Collector {
@Override public void setNextReader(AtomicReaderContext context) throws IOException { readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter); collected = 0; }
@Override public void collect(int doc) throws IOException { if (readerIsSorted && (++collected >= maxDocsToCollect || curVal <= pq.top()) { // Special exception that tells IndexSearcher to terminate // collection of the current segment throw new CollectionTerminatedException(); } else { // collect hit } }}
![Page 28: Recent Additions to Lucene Arsenal](https://reader036.vdocuments.site/reader036/viewer/2022081400/554f6c01b4c9058a148b4f84/html5/thumbnails/28.jpg)
Questions?