what's new in solr june 2014
DESCRIPTION
TRANSCRIPT
1
What’s New in SolrSolr 4.7 & 4.8
June 12, 2014
Search | Discover | Analyze
2
Speaker
• Software Engineer at LucidWorks• Lucene/Solr committer and PMC member• Previously worked on search and NLP at
the Center for Natural Language Processing at Syracuse University’s iSchool
• Twitter: @steven_a_roweSteve Rowe
3
Agenda
• A short history of Solr 4• Solr 4.7 and 4.8: new features• Solr 4.9 and beyond
4
A short history of Solr 4
• Solr 4.0 released October 2012
5
A short history of Solr 4
• SolrCloud– Distributed indexing and searching, NRT and
NoSQL features, e.g. realtime-get, optimistic concurrency and durable updates
– Sharding, replication, ZooKeeper ensemble– High availability with no single points of failure
• Real-time Get: Access latest document version, no commit or new searcher open required
• Atomic updates: incremental field add/update/increment via stored fields
• NRT: “soft” commits
6
A short history of Solr 4
• Solr Reference Guide now released with each feature release:– Live (targeting next Solr release):
http://s.apache.org/SolrReferenceGuide–Most recent released PDF:
http://s.apache.org/Solr-Ref-Guide-PDF– Previous release PDFs:
http://s.apache.org/Older-Solr-Ref-Guide-PDFs
7
A short history of Solr 4
• Flexible indexing– Solr core = Lucene index
• Lucene index = 1 or more segments
– Codec: per-segment suite of formats
• Flexible scoring– You can specify similarity implementation per
fieldType in your schema.xml if you use SchemaSimilarityFactory
– Built-in Similarities (other than the default TF-IDF):• Okapi BM25• Divergence from Randomness• Information-Based• Language Models (with two smoothing implementations)• SweetSpot
8
A short history of Solr 4
• DocValues: typed column stride fields– Document-to-value mapping built at index time– Reduced memory usage compared to field cache– Good for faceting and sorting– Missing values now supported as of Solr 4.5
• Pseudo-fields– Field aliasing, e.g. &fl=result:indexed– Function queries, aliasable too, e.g. &fl=price:sum(a,b)
– Document transformers• Standard: [explain], [value], [shard], [docid]
• Pseudo-joins, e.g. ?q={!join+from=manu+to=id}ipod• Pivot faceting: automatic drill-down (no distr.’d
support)
9
A short history of Solr 4
• Schema API• GET /collection/schema/fields/fieldname• PUT /collection/schema/fields/name
• JSON body: { "type":"text_general", "stored":true,
"indexed":true }
• Schemaless mode• a.k.a. data-driven schema or field guessing• Class guessed based on field values, then
class(es) mapped to a fieldType; first gets added to the schema
• Supported value classes: Boolean, Integer, Long, Float, Double, and Date
10
A short history of Solr 4
• Document routing– CompositeId router, e.g. id=tenant!docid• Used by default when numShards specified
when creating a collection.• Restrict queries to shard(s): &_route_=tenant!
– Implicit router
• Online shard splitting– Allows collections to scale, rather than
having to decide on how much to overshard up front.
– Split in two; with custom hash ranges; or using split.key param to split to a dedicated shard
11
A short history of Solr 4
• Nested documents, a.k.a. Block Join– Nested doc to be added:
<add> <doc> <field name="id">1</field> <field name="title">Solr adds block join support</field> <field name="content_type">parentDocument</field> <doc> <field name="id">2</field> <field name="comments">SolrCloud supports it too!</field> </doc> </doc></add>
– Queries:• Child query parser, e.g.
q={!child of="content_type:parentDocument"}title:Solr
• Parent query parser, e.g. q={!parent which="content_type:parentDocument"}comments:SolrCloud
12
A short history of Solr 4
• solr.xml legacy & discovery modes– Legacy mode (cores listed in solr.xml) is
deprecated; support will be removed in Solr 5.
– Discovery mode (new as of Solr 4.3):• No cores are listed in solr.xml• Cores are discovered by a recursive walk of the
solr home directory, marked by core.properties files • Nested core directories are not allowed
13
A short history of Solr 4
• New web admin UI with SolrCloud support
14
Solr 4.7 and 4.8: new features
• As of Solr 4.8, Java 7 is the minimum supported JVM version. Recommended: Oracle 1.7.0_60
• <fields> and <types> tags are no longer necessary in schema.xml
• Collections API improvements– Working toward “ZooKeeper = Truth” mode
• legacyCloud=false cluster property
– New actions:• CLUSTERSTATUS, LIST, ADDROLE, DELETEROLE,
ADDREPLICA, DELETEREPLICA, OVERSEERSTATUS, MIGRATE, CLUSTERPROP
– Core properties can be specified with CREATE and SPLITSHARD actions
15
Solr 4.7 and 4.8: new features
• Asynchronous execution of long-running actions– SolrCloud Collections API:
• CREATE, SPLITSHARD, MIGRATE
– CoreAdminHandler: • CREATE, RENAME, UNLOAD, SWAP, MERGEINDEXES,
SPLIT
– Tracking request ID supplied via async param– Track status via the new REQUESTSTATUS
action, using the tracking request ID• Possible states: running, complete, failed, notfound
– Clear stored statuses with special request ID -1
16
Solr 4.7 and 4.8: new features
• Cursors: Efficient Deep Paging– Request must include a sort, which must
include the uniqueKey, which must be defined
– First page: ?q=…&sort=id+asc&rows=N&cursorMark=*
• Response contains "nextCursorMark":"<base64encoded>"
– Following pages:?q=…&sort=id+asc&rows=N&cursorMark=<from response>
– Repeat; when nextCursorMark=cursorMark from the request, there are no more results
– No server-side state
17
Solr 4.7 and 4.8: new features
18
Solr 4.7 and 4.8: new features
• Document expiration and Time To Live (TTL)– Auto-delete expired documents
• DocExpirationUpdateProcessorFactory can periodically wake up and delete expired documents
– Compute expiration date from TTL• Update request _ttl_ param, or• Document _ttl_ field• Both names are configurable, defaulting to _ttl_.• _ttl_ values are interpreted as Date Math
Expressions relative to NOW, e.g. “+1YEAR”.
19
Solr 4.7 and 4.8: new features
• Dynamic synonyms and stopwords– “Managed” resources: configuration and content
for synonyms and stopwords, persistence managed by Solr
– Specified as ManagedSynonymFilterFactory and ManagedStopFilterFactory on analyzers in schema.xml
– CRUD operations are enabled via a REST endpoint per managed resource.
– The “managed” attribute names the REST endpoint, e.g.<filter class="solr.ManagedStopFilterFactory" managed="french" />
– E.g. to delete stopword “le” from the “french” managed stoplist:curl -X DELETE "…/solr/colln/schema/analysis/stopwords/french/le"
20
Solr 4.7 and 4.8: new features
• SSL support in SolrCloud– URL scheme stored in ZooKeeper– SSL certificates are specifiable via system
properties, to enable authentication
• Nested documents may be specified in JSON format
• Tri-level compositeId routing– E.g. “tenant!group!docid”, 8/8/16 hash bits per
component
• Build Solr indexes with Hadoop’s MapReduce– +Mark Miller’s blog: http://bit.ly/1oh0fWq
• Github solr-map-reduce-example: http://bit.ly/1pnDAao
• Named config sets in non-SolrCloud mode– Default base directory is SOLR_HOME/configsets/
21
Solr 4.7 and 4.8: new features
• Suggester v2– Added BlendedInfixSuggester– Added FreeTextSuggester– Queries can use multiple suggesters
• New query parsing features– SimpleQParserPlugin: parser for human
entered queries with selectable operators.– ComplexPhraseQParserPlugin: wildcards, ORs,
etc. inside Phrase Queries• E.g. {!complexphrase inOrder=true}name:"Jo* Smith"
22
Solr 4.7 and 4.8: new features
• CollapsingQParserPlugin– Performant alternative grouping/field
collapsing implementation, for high distinct group cardinality.
• ExpandComponent– Expands collapsed groups– Can also expand nested documents
23
Solr 4.9 and beyond
• ZooKeeper = Truth / legacyCloud=false
• MODIFYCOLLECTION collections API–Modify maxShardsPerNode, replicationFactor
for the entire collection
• Incremental Field Updates on numeric DocValues– Binary DocValues IFUs also coming
• Multi-valued DocValues sort fields• Legacy numeric/date field types deprecated,
removed in Solr 5 in favor of Trie field types
24
Solr 4.9 and beyond
• In Solr 5, the .war will no longer be shipped• Index integrity: checksums• Integrity check on merge off by default• solrconfig.xml option
<indexConfig><checkIntegrityAtMerge>
• New update query param min_rf will allow clients to set the minimum successful replicas for the request
• Return Block Join child documents when parents match, via a new DocTransformer
[child parentFilter=“field:value”]
25
Solr 4.9 and beyond
• AnalyticsQuery: support pluggable, pipeline-able analytics, orderable via the “cost” parameter, like PostFilters.
• ReRankingQParserPlugin • Re-rank the top n results
26
Platform
LucidWorks Open Source
• Effortless AWS deployment and monitoring: http://www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr): https://github.com/LucidWorks/banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash
27
LinksSolr website: http://lucene.apache.org/solrSolr Reference Guide:
• Live (targeting next Solr release): http://s.apache.org/SolrReferenceGuide
• Most recent released PDF: http://s.apache.org/Solr-Ref-Guide-PDF
• Previous release PDFs: http://s.apache.org/Older-Solr-Ref-Guide-PDFs
Lucene/Solr Revolution: http://www.LuceneRevolution.org
Q & A