Solr 3.1 and Beyond
Yonik Seeley
Lucid Imagination
October 8, 2010
1
Agenda
Goal : Introduce new features you can try & use now in Solr development versions 3.1 or 4.0
Relevancy (Extended Dismax Parser)Spatial/Geo SearchSearch Result Grouping / Field CollapsingFaceting (Pivot, Range, Per-segment)Scalability (Solr Cloud)Odds & EndsQ&A
04/21/23 2
Solr 3.1? What happened to 1.5?
Lucene/Solr merged (March 2010) Single set of committers Single dev mailing list ([email protected]) Single shared subversion trunk Keep separate downloads, user mailing lists Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)
Development trunk is now always next major release (currently 4.0) branch_3x will be base for all 3.x releases Branch together, Release together, Share version numbers
RELEVANCE
Extended Dismax ParserSuperset of dismax
&defType=edismax&q=foo&qf=body
Fixes edge cases where dismax could still throw exceptionsOR AND NOT - “
Full lucene syntax support Tries lucene syntax first Smart escaping is done if syntax errors
Optionally supports treating “and”/”or” as AND/OR in lucene syntax
Fielded queries (e.g. myfield:foo) even in degraded mode
uf parameter controls what field names may be directly specified in “q”
Extended Dismax Parser (continued)boost parameter for multiplicative boost-by-functionPure negative query clauses
Example: solr OR (-solr)
Enhanced term proximity boosting pf2=myfield – results in term bigrams in sloppy phrase queries
myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc”
Enhanced stopword handling stopwords omitted in main query, but added in optional proximity boosting part
Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”)
Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer
SPATIAL SEARCH
7
Spatial Search
04/21/23 8
Step1: Index some locations!<field name=“name”>The Alpine Shop</field><field name=“store”>44.013617,-73.168264</field>
Step2: Decide where you are&pt=44.0153371,-73.16734&d=1&sfield=store
Step3: Profit!
Spatial Filter: &fq={!geofilt}
Bounding Box: &fq={!bbox}
Distance Function: &sort=geodist() asc
RESULT GROUPING /FIELD COLLAPSING
Field Collapsing Definition
Field collapsing Limit the number of results per category “category” normally defined by unique values in a field
Uses Web Search – collapse by web site Email threads – collapse by thread id Ecommerce/retail
Show the top 5 items for each store category (music, movies, etc)
Field Collapsing by Site
Field Collapse on Product TypeResult Grouping by Category
Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
04/21/23 13
"grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback
Black"}] }}]}}}
Group by Query
04/21/23 14
http://...&group=true&group.query=price:[0 TO 99.99]&group.query=price:[100 TO *]&group.limit=5
"grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback
Black"}] }}}}
Grouping Params
parameter meaning default
group.field=<field> Like facet.field – group by unique field values
group.query=<query> Like facet.query – top docs that also match
group.function=<function query>
Group by unique values produced by the function query
group.limit=<n> How many docs per group 1
group.sort=<sort spec> How to sort documents within a group Same as “sort” param
rows=<n> How many groups to return 10
sort=<sort spec> How to sort the groups relative to each other (based on top doc)
04/21/23 15
FACETING
Pivot Faceting
Other names that could have made sense: Grid Faceting, Cross-Product Faceting, Matrix Faceting
Syntax: facet.pivot=field1,field2,field3,…
04/21/23 17
#docs #docs w/ inStock:true
#docs w/ instock:false
cat:electronics 14 10 4
cat:memory 3 3 0
cat:connector 2 0 2
cat:graphics card 2 0 2
cat:hard drive 2 2 0
facet.pivot=cat,inStock
Pivot Faceting
"facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4},
04/21/23 18
http://...&facet=true&facet.pivot=cat,popularity
(continued)
{ "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]},
[…]
14 docs w/cat==electronics
5 docs w/cat==electronics&& popularity==6
Range Faceting
• Like Date faceting, but more generic
http://...&facet=true&facet.range=price&facet.range.start=0&facet.range.end=500&facet.range.gap=50
"facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}
04/21/23 19
53514521
(null)batman
flashspidermansupermanwolverine
order: for each doc, an index into the lookup array
lookup: the string values
Lucene FieldCache Entry (StringIndex) for the “hero” field
027
010002
Documents matching the base query “Juggernaut”
accumulator
increment
lookup
q=Juggernaut&facet=true&facet.field=hero
Priority queue
Batman, 3flash, 5
Existing single-valued faceting algorithm
Segment1FieldCache
Entry
Segment2FieldCache
Entry
Segment3FieldCache
Entry
Segment4FieldCache
Entry
027
035012
0210
1304
010
Priority queue
Batman, 3flash, 5
Base DocSet
lookupinc
accumulator1 accumulator2 accumulator3 accumulator4
FieldCache + accumulator merger(Priority queue)
thread1
thread2 thread3thread4
Per-segment single-valued algorithm
Per-segment faceting
Enable with facet.method=fcsControllable multi-threading
facet.field={!threads=4}myfield
Disadvantages Larger memory use (FieldCaches + accumulators) Slower (extra FieldCache merge step needed)
Advantages Rebuilds FieldCache entries only for new segments (NRT friendly) Multi-threaded
Per-segment faceting performance comparison
Time for request* facet.method=fc facet.method=fcs
static index 3 ms 244 ms
quickly changing index 1388 ms 267 ms
Base DocSet=100 docs, facet.field on a field with 100,000 unique terms
Test index: 10M documents, 18 segments, single valued field
Time for request* facet.method=fc facet.method=fcs
static index 26 ms 34 ms
quickly changing index 741 ms 94 ms
Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms
*complete request time, measured externally
A
B
Faceting Performance Improvements
For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement
Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster
Optimized deep facet paging – up to 10x faster with really large facet.offsets
Less memory consumed by field cache entries
04/21/23 24
SCALABILITY
SolrCloud
First steps toward simplifying cluster managementIntegrates Zookeeper
Central configuration (schema.xml, solrconfig.xml, etc) Tracks live nodes + shards of collections
Removes need for external load balancersshards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr
Can specify logical shard idsshards=NY_shard,NJ_shard
Clients don’t need to know shards at all:http://localhost:8983/solr/collection1/select?distrib=true
SolrCloud : The Future
Eliminate all single points of failureRemove Master/Searcher distinction
Enables near real-time search in a highly scalable environment
High Availability for Writes Eventual consistency model (like Amazon Dynamo, Cassandra)
Elastic Simply add/subtract servers, cluster will rebalance automatically By default, Solr will handle document partitioning
ODDS & ENDS
Auto-SuggestMany people currently use terms component
Can be slow for a large corpus
New auto-suggest builds off SpellCheck component Compact memory based trie for really fast completions Based on a field in the main index, or on a dictionary file
http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult
04/21/23 29
"spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}
Index with JSON$ URL=http://localhost:8983/solr/update/json$ curl $URL -H 'Content-type:application/json' -d '{"add": { "doc": { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 }}}'
30
Query Results in CSV
http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv
name,price,cat,popularity
iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1
Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1
Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10
Can handle multi-valued fields (see “cat” field in example) Completely compatible with the CSV update handler (can round-trip) Results are streamed – good for dumping entire parts of the index
04/21/23 31
http://localhost:8983/solr/browse
04/21/23 32
Q&A