what's new in solr june 2014

1

What’s New in SolrSolr 4.7 & 4.8

June 12, 2014

Search | Discover | Analyze

2

Speaker

• Software Engineer at LucidWorks• Lucene/Solr committer and PMC member• Previously worked on search and NLP at

the Center for Natural Language Processing at Syracuse University’s iSchool

• Twitter: @steven_a_roweSteve Rowe

3

Agenda

• A short history of Solr 4• Solr 4.7 and 4.8: new features• Solr 4.9 and beyond

4

A short history of Solr 4

• Solr 4.0 released October 2012

5


• SolrCloud– Distributed indexing and searching, NRT and

NoSQL features, e.g. realtime-get, optimistic concurrency and durable updates

– Sharding, replication, ZooKeeper ensemble– High availability with no single points of failure

• Real-time Get: Access latest document version, no commit or new searcher open required

• Atomic updates: incremental field add/update/increment via stored fields

• NRT: “soft” commits

6


• Solr Reference Guide now released with each feature release:– Live (targeting next Solr release):

http://s.apache.org/SolrReferenceGuide–Most recent released PDF:

http://s.apache.org/Solr-Ref-Guide-PDF– Previous release PDFs:

http://s.apache.org/Older-Solr-Ref-Guide-PDFs

http://s.apache.org/SolrReferenceGuide

http://s.apache.org/Solr-Ref-Guide-PDF



7


• Flexible indexing– Solr core = Lucene index

• Lucene index = 1 or more segments

– Codec: per-segment suite of formats

• Flexible scoring– You can specify similarity implementation per

fieldType in your schema.xml if you use SchemaSimilarityFactory

– Built-in Similarities (other than the default TF-IDF):• Okapi BM25• Divergence from Randomness• Information-Based• Language Models (with two smoothing implementations)• SweetSpot

8


• DocValues: typed column stride fields– Document-to-value mapping built at index time– Reduced memory usage compared to field cache– Good for faceting and sorting– Missing values now supported as of Solr 4.5

• Pseudo-fields– Field aliasing, e.g. &fl=result:indexed– Function queries, aliasable too, e.g. &fl=price:sum(a,b)

– Document transformers• Standard: [explain], [value], [shard], [docid]

• Pseudo-joins, e.g. ?q={!join+from=manu+to=id}ipod• Pivot faceting: automatic drill-down (no distr.’d

support)

9


• Schema API• GET /collection/schema/fields/fieldname• PUT /collection/schema/fields/name

• JSON body: { "type":"text_general", "stored":true,

"indexed":true }

• Schemaless mode• a.k.a. data-driven schema or field guessing• Class guessed based on field values, then

class(es) mapped to a fieldType; first gets added to the schema

• Supported value classes: Boolean, Integer, Long, Float, Double, and Date

10


• Document routing– CompositeId router, e.g. id=tenant!docid• Used by default when numShards specified

when creating a collection.• Restrict queries to shard(s): &_route_=tenant!

– Implicit router

• Online shard splitting– Allows collections to scale, rather than

having to decide on how much to overshard up front.

– Split in two; with custom hash ranges; or using split.key param to split to a dedicated shard

11


• Nested documents, a.k.a. Block Join– Nested doc to be added:

<add> <doc> <field name="id">1</field> <field name="title">Solr adds block join support</field> <field name="content_type">parentDocument</field> <doc> <field name="id">2</field> <field name="comments">SolrCloud supports it too!</field> </doc> </doc></add>

– Queries:• Child query parser, e.g.

q={!child of="content_type:parentDocument"}title:Solr

• Parent query parser, e.g. q={!parent which="content_type:parentDocument"}comments:SolrCloud

12


• solr.xml legacy & discovery modes– Legacy mode (cores listed in solr.xml) is

deprecated; support will be removed in Solr 5.

– Discovery mode (new as of Solr 4.3):• No cores are listed in solr.xml• Cores are discovered by a recursive walk of the

solr home directory, marked by core.properties files • Nested core directories are not allowed

13


• New web admin UI with SolrCloud support

14

Solr 4.7 and 4.8: new features

• As of Solr 4.8, Java 7 is the minimum supported JVM version. Recommended: Oracle 1.7.0_60

• <fields> and <types> tags are no longer necessary in schema.xml

• Collections API improvements– Working toward “ZooKeeper = Truth” mode

• legacyCloud=false cluster property

– New actions:• CLUSTERSTATUS, LIST, ADDROLE, DELETEROLE,

ADDREPLICA, DELETEREPLICA, OVERSEERSTATUS, MIGRATE, CLUSTERPROP

– Core properties can be specified with CREATE and SPLITSHARD actions

15


• Asynchronous execution of long-running actions– SolrCloud Collections API:

• CREATE, SPLITSHARD, MIGRATE

– CoreAdminHandler: • CREATE, RENAME, UNLOAD, SWAP, MERGEINDEXES,

SPLIT

– Tracking request ID supplied via async param– Track status via the new REQUESTSTATUS

action, using the tracking request ID• Possible states: running, complete, failed, notfound

– Clear stored statuses with special request ID -1

16


• Cursors: Efficient Deep Paging– Request must include a sort, which must

include the uniqueKey, which must be defined

– First page: ?q=…&sort=id+asc&rows=N&cursorMark=*

• Response contains "nextCursorMark":"<base64encoded>"

– Following pages:?q=…&sort=id+asc&rows=N&cursorMark=<from response>

– Repeat; when nextCursorMark=cursorMark from the request, there are no more results

– No server-side state

17


18


• Document expiration and Time To Live (TTL)– Auto-delete expired documents

• DocExpirationUpdateProcessorFactory can periodically wake up and delete expired documents

– Compute expiration date from TTL• Update request _ttl_ param, or• Document _ttl_ field• Both names are configurable, defaulting to _ttl_.• _ttl_ values are interpreted as Date Math

Expressions relative to NOW, e.g. “+1YEAR”.

19


• Dynamic synonyms and stopwords– “Managed” resources: configuration and content

for synonyms and stopwords, persistence managed by Solr

– Specified as ManagedSynonymFilterFactory and ManagedStopFilterFactory on analyzers in schema.xml

– CRUD operations are enabled via a REST endpoint per managed resource.

– The “managed” attribute names the REST endpoint, e.g.<filter class="solr.ManagedStopFilterFactory" managed="french" />

– E.g. to delete stopword “le” from the “french” managed stoplist:curl -X DELETE "…/solr/colln/schema/analysis/stopwords/french/le"

20


• SSL support in SolrCloud– URL scheme stored in ZooKeeper– SSL certificates are specifiable via system

properties, to enable authentication

• Nested documents may be specified in JSON format

• Tri-level compositeId routing– E.g. “tenant!group!docid”, 8/8/16 hash bits per

component

• Build Solr indexes with Hadoop’s MapReduce– +Mark Miller’s blog: http://bit.ly/1oh0fWq

• Github solr-map-reduce-example: http://bit.ly/1pnDAao

• Named config sets in non-SolrCloud mode– Default base directory is SOLR_HOME/configsets/

http://bit.ly/1oh0fWq

http://bit.ly/1pnDAao

http://bit.ly/1pnDAao

21


• Suggester v2– Added BlendedInfixSuggester– Added FreeTextSuggester– Queries can use multiple suggesters

• New query parsing features– SimpleQParserPlugin: parser for human

entered queries with selectable operators.– ComplexPhraseQParserPlugin: wildcards, ORs,

etc. inside Phrase Queries• E.g. {!complexphrase inOrder=true}name:"Jo* Smith"

22


• CollapsingQParserPlugin– Performant alternative grouping/field

collapsing implementation, for high distinct group cardinality.

• ExpandComponent– Expands collapsed groups– Can also expand nested documents

23

Solr 4.9 and beyond

• ZooKeeper = Truth / legacyCloud=false

• MODIFYCOLLECTION collections API–Modify maxShardsPerNode, replicationFactor

for the entire collection

• Incremental Field Updates on numeric DocValues– Binary DocValues IFUs also coming

• Multi-valued DocValues sort fields• Legacy numeric/date field types deprecated,

removed in Solr 5 in favor of Trie field types

24

Solr 4.9 and beyond

• In Solr 5, the .war will no longer be shipped• Index integrity: checksums• Integrity check on merge off by default• solrconfig.xml option

<indexConfig><checkIntegrityAtMerge>

• New update query param min_rf will allow clients to set the minimum successful replicas for the request

• Return Block Join child documents when parents match, via a new DocTransformer

[child parentFilter=“field:value”]

25

Solr 4.9 and beyond

• AnalyticsQuery: support pluggable, pipeline-able analytics, orderable via the “cost” parameter, like PostFilters.

• ReRankingQParserPlugin • Re-rank the top n results

26

Platform

LucidWorks Open Source

• Effortless AWS deployment and monitoring: http://www.github.com/lucidworks/solr-scale-tk

• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager

• Banana (Kibana for Solr): https://github.com/LucidWorks/banana

• Data Quality Toolkit: https://github.com/LucidWorks/data-quality

• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash

http://www.github.com/lucidworks/solr-scale-tk

https://github.com/LucidWorks/solrlogmanager

https://github.com/LucidWorks/banana

https://github.com/LucidWorks/data-quality

27

LinksSolr website: http://lucene.apache.org/solrSolr Reference Guide:

• Live (targeting next Solr release): http://s.apache.org/SolrReferenceGuide

• Most recent released PDF: http://s.apache.org/Solr-Ref-Guide-PDF

• Previous release PDFs: http://s.apache.org/Older-Solr-Ref-Guide-PDFs

Lucene/Solr Revolution: http://www.LuceneRevolution.org

Q & A

http://lucene.apache.org/solr

http://s.apache.org/SolrReferenceGuide

http://s.apache.org/Solr-Ref-Guide-PDF


http://www.lucenerevolution.org/

http://www.lucenerevolution.org/

what's new in solr june 2014

Technology

solr solr

short history of solr

solr release

new features solr

solr reference guide

solr home directory

flexible indexing solr

solr parent query parser