query latency optimization with lucene
DESCRIPTION
Presented by Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business Besides the quality of results, the time that it takes from the submission of a query to the display of results is of utmost importance to user satisfaction. Within search engine implementations such as Apache Lucene, significant development efforts are hence directed towards reducing query latency. In this session, I will explain reasons for high query latencies and describe general approaches and recent developments within Lucene to counter them.To make the presented material relevant to a wider audience, I will focus on the actual query processing, as this is at the core of every query and search use-case.TRANSCRIPT
[email protected] Sr. Research Engineer, Ph.D.Stefan PohlQuery Latency Optimization
7 Nov 2013 Query Latency Optimization with Lucene 2
Who Am I● Search user, developer, researcher
● Many years in industry & academia
● Ph.D. in Information Retrieval
● Interests: Search, Big Data, Machine Learning
● Currently working on the Geocoding offer of HERE,
Nokia's Location Platform
● Spare time: Lucene contributor
7 Nov 2013 Query Latency Optimization with Lucene 3
Agenda● Motivation
● Latency Optimization
● Query Processing / Scoring
● Recent Developments in Lucene
7 Nov 2013 Query Latency Optimization with Lucene 4
Motivation: Query Latency● Human Reaction Time: 200 ms*
→ Backend latency: << 200 ms
● Faster queries means higher manageable load
● Costs
* Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception inSoftware, Addison-Wesley Professional, 2008.
7 Nov 2013 Query Latency Optimization with Lucene 5
Motivation: Query Latency Distribution
7 Nov 2013 Query Latency Optimization with Lucene 6
Latency Optimization
7 Nov 2013 Query Latency Optimization with Lucene 7
First: Do Your Homework● Keep enough RAM for OS (disk buffer cache)● Reduce HDD “pressure” (e.g. throttle indexing)● SSDs● Warming● Ideally: your index fits in memory
See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
7 Nov 2013 Query Latency Optimization with Lucene 8
Mining Hypothesis● Check if query latencies are reproducible
● If not, try to find correlations with system events:– Many new incoming docs to index?– Other daemons spike in disk or CPU activity?– Garbage Collections?– Other sar statistics (e.g. paging)
● If yes, profile– First, your code– Don't instrument Lucene internal low-level classes
7 Nov 2013 Query Latency Optimization with Lucene 9
Hypothesis Testing● You really think you understand the problem
and have a potential solution?
● Try it out (if it's cheap)!
● Otherwise, think of (cheap) experiments that– Give confidence– Tell you (and others) what the gains are (ROI)
7 Nov 2013 Query Latency Optimization with Lucene 10
Example: In-memory● Buy more memory / bigger machine !?
● Simulate1
– Consecutively execute the same query multiple times– Much lower memory requirement (i.e. the size of the involved postings)– Repeat for sample of queries of interest
● Gives lower bound on query latency
1 S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European Conference on Information Retrieval, Toulouse, France, April 2009. Springer.
7 Nov 2013 Query Latency Optimization with Lucene 11
Query Processing
7 Nov 2013 Query Latency Optimization with Lucene 12
Conjunctions (i.e. AND / Occur.MUST)
● Sort Boolean clauses by increasing DocFreq ft
7 Nov 2013 Query Latency Optimization with Lucene 13
Conjunctions (i.e. AND / Occur.MUST)
● Next() on sparsest posting list (“lead”)
7 Nov 2013 Query Latency Optimization with Lucene 14
Conjunctions (i.e. AND / Occur.MUST)
● Advance(18) on next sparsest posting list → fail
7 Nov 2013 Query Latency Optimization with Lucene 15
Conjunctions (i.e. AND / Occur.MUST)
● Start all over again with “lead”, but advance(22)
7 Nov 2013 Query Latency Optimization with Lucene 16
Conjunctions (i.e. AND / Occur.MUST)
● Try to advance(31) on all other posting lists
7 Nov 2013 Query Latency Optimization with Lucene 17
Conjunctions (i.e. AND / Occur.MUST)
● Try to advance(31) on all other posting lists
7 Nov 2013 Query Latency Optimization with Lucene 18
Conjunctions (i.e. AND / Occur.MUST)
● Try to advance(31) on all other posting lists
7 Nov 2013 Query Latency Optimization with Lucene 19
Conjunctions (i.e. AND / Occur.MUST)
● Match found → R = {31
7 Nov 2013 Query Latency Optimization with Lucene 20
Conjunctions (i.e. AND / Occur.MUST)
● Next() on “lead” → R = {31}
7 Nov 2013 Query Latency Optimization with Lucene 21
Disjunctions (i.e. OR / Occur.SHOULD)
7 Nov 2013 Query Latency Optimization with Lucene 22
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() on all clauses
7 Nov 2013 Query Latency Optimization with Lucene 23
Disjunctions (i.e. OR / Occur.SHOULD)
● Track clauses in min-heap → R = {2
7 Nov 2013 Query Latency Optimization with Lucene 24
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() on all previously matched clauses → R = {2,4
7 Nov 2013 Query Latency Optimization with Lucene 25
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() on all previously matched clauses → R = {2,4,5
7 Nov 2013 Query Latency Optimization with Lucene 26
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7
7 Nov 2013 Query Latency Optimization with Lucene 27
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9
7 Nov 2013 Query Latency Optimization with Lucene 28
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11
7 Nov 2013 Query Latency Optimization with Lucene 29
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12
7 Nov 2013 Query Latency Optimization with Lucene 30
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16
7 Nov 2013 Query Latency Optimization with Lucene 31
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18
7 Nov 2013 Query Latency Optimization with Lucene 32
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20
7 Nov 2013 Query Latency Optimization with Lucene 33
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22
7 Nov 2013 Query Latency Optimization with Lucene 34
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26
7 Nov 2013 Query Latency Optimization with Lucene 35
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27
7 Nov 2013 Query Latency Optimization with Lucene 36
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29
7 Nov 2013 Query Latency Optimization with Lucene 37
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31
7 Nov 2013 Query Latency Optimization with Lucene 38
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32
7 Nov 2013 Query Latency Optimization with Lucene 39
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37
7 Nov 2013 Query Latency Optimization with Lucene 40
Disjunctions (i.e. OR / Occur.SHOULD)
● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37}
7 Nov 2013 Query Latency Optimization with Lucene 41
Why Query Processing Can Be Slow?● Disjunctive Processing: O(n log |C|)
– High DF terms (large n)– Many terms (large |C|), e.g. query expansion– No / too little use of advance()
● Filter (over-use)
7 Nov 2013 Query Latency Optimization with Lucene 42
Filter● Aims:
– (Pre-)computation of common sub-queries– Cache result– Don't influence scoring
● Limitation– Additional cost for 1st query– Currently, no skip information generated
→ Adding filter as a conjunct to queries can sometimes be fastere.g. http://java.dzone.com/news/fast-lucene-search-filters
7 Nov 2013 Query Latency Optimization with Lucene 43
Stopword Removal● Removal of High-DocFreq terms from
– Index : 10-30% space saving– Query: no very expensive terms
● Limitation:– “To be or not to be”
● In general, don't do it
7 Nov 2013 Query Latency Optimization with Lucene 44
Minor, But Easy Improvements● Reduce information, increase locality:
– Don't store TF, if it's almost always 1 (and you don't need positions),fieldType.setIndexOptions(IndexOptions.DOCS_ONLY);
– Use BlockPostingsFormat (default in Lucene ≥ 4.1)
● Tune Space/Time/Quality tradeoffs:– DirectDocValues– Less complex scoring function
7 Nov 2013 Query Latency Optimization with Lucene 45
Recent Developmentswithin Lucene
7 Nov 2013 Query Latency Optimization with Lucene 46
MinShouldMatch● Don't want matches on only one (stop-)word?● Enforce at least mm>1 terms to be present !
● Synthetic example query used during dev:
(Lucene-4571)
Terms: ref restored struck wings dublin
DocFreq: 3.8M 32k 32k 32k 32k
Disjunctive Processing:next()
Conjunctive Processing:advance()
E.g. mm=2:
7 Nov 2013 Query Latency Optimization with Lucene 47
MinShouldMatch (Lucene-4571)
7 Nov 2013 Query Latency Optimization with Lucene 48
MinShouldMatch (Lucene-4571)
7 Nov 2013 Query Latency Optimization with Lucene 49
MinShouldMatch (Lucene-4571)
DocFreq: 3.8M 32k 32k 32k 32k
HighDF 1/5: ref restored struck wings dublin
HighDF 2/5: ref http struck wings dublin
HighDF 3/5: ref http from wings dublin
HighDF 4/5: ref http from name dublin
HighDF 5/5: ref http from name title
DocFreq: 3.8M 3.5M 3.2M 2.8M 2.4M
7 Nov 2013 Query Latency Optimization with Lucene 50
MinShouldMatch – Results (Lucene-4571)
7 Nov 2013 Query Latency Optimization with Lucene 51
MinShouldMatch – Open Questions● How bad is it to exclude docs that only match one,
but an important term?
● Why is it enough to match any mm terms?
● Why not providing a list of stop-words to a 'StopwordExcludingScorer'?(But be careful: “To Be Or Not To Be”)
(Lucene-4571)
7 Nov 2013 Query Latency Optimization with Lucene 52
ReqOptSumScorer● Benefit:
– Conjunctive processing on required clauses– Calls advance() on optional clauses
● How do you determine which clauses are required?– Lookup term statistics (i.e. DocFreq)– 2nd lookup unnecessary, if you hand over stats to query
7 Nov 2013 Query Latency Optimization with Lucene 53
CommonTermsQuery (≥ 4.1)● Looks up term infos (docfreq, posting list offset)● Categorizes query terms as
– Low-freq: At least one low-freq term MUST occur in result doc
– High-freq: SHOULD occur in doc → their presence add to score
● Executes query, but hands over term statistics
→ no 2nd round of term lookups necessary !
● Also supports MinShouldMatch
(Lucene-4628)
7 Nov 2013 Query Latency Optimization with Lucene 54
Cost-Model (≥ 4.3)● What about structured queries? E.g. +(a b) +c
● Currently: worst-case estimate of returned #docs (docfreq)– Disjunctions: sumcC(dfc)
– Conjunctions: mincC(dfc)
● Limitations:– Effort to generate returned docs?– Only one cost (next() vs. advance())
● Open Question:– Can we do better with more detailed cost models?
(Lucene-4607)
7 Nov 2013 Query Latency Optimization with Lucene 55
Maxscore Top-k Scoring Algorithm1
● Experimental prototype code attached to Lucene-4100● Limitation:
– Requires final run over whole index (i.e. only for static indexes)
(Lucene-4100)
1 H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995.
7 Nov 2013 Query Latency Optimization with Lucene 56
Index Sorting (≥ 4.3)● Advantages (if appropriate sort order chosen)
– Better compression → more locality → faster processing– Early termination
● Use together with EarlyTerminatingSortingCollector– Can terminate scoring within sorted segments– Fully scores as-yet unsorted segments
→ see 2nd half of Shai & Adrian's talk yesterday for details
(Lucene-4752)
7 Nov 2013 Query Latency Optimization with Lucene 57
Parallelization● In general, sharding is better:
– Shared-nothing– Better use cores for handling load
● Multi-threaded query execution:– Static indexes:
For slow queries, almost perfect speedups(if docs are uniformly distributed over shards)
– Dynamic indexes:● Lucene-2840, Lucene-5299
7 Nov 2013 Query Latency Optimization with Lucene 58
Summary● Understand your problem
● Scoring can become an issue with many million docs
● Many recent efficiency improvements
● More to come... patches welcome
7 Nov 2013 Query Latency Optimization with Lucene 59
We're Hiring @HEREFrankfurt, Berlin, Boston, Chicago.
Come work with us.Get in touch!Come work with us.Get in touch!
developer.here.com/geocoder
7 Nov 2013 Query Latency Optimization with Lucene 60
Thank You!
Contact
Email : [email protected] : http://linkedin.com/in/stefanpohlTwitter : @pohlstefan
developer.here.com/geocoder