what's new in solr june 2014

27
1 What’s New in Solr Solr 4.7 & 4.8 June 12, 2014 Search | Discover | Analyze

Upload: lucidworks-archived

Post on 26-Jan-2015

115 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: What's new in solr june 2014

1

What’s New in SolrSolr 4.7 & 4.8

June 12, 2014

Search | Discover | Analyze

Page 2: What's new in solr june 2014

2

Speaker

• Software Engineer at LucidWorks• Lucene/Solr committer and PMC member• Previously worked on search and NLP at

the Center for Natural Language Processing at Syracuse University’s iSchool

• Twitter: @steven_a_roweSteve Rowe

Page 3: What's new in solr june 2014

3

Agenda

• A short history of Solr 4• Solr 4.7 and 4.8: new features• Solr 4.9 and beyond

Page 4: What's new in solr june 2014

4

A short history of Solr 4

• Solr 4.0 released October 2012

Page 5: What's new in solr june 2014

5

A short history of Solr 4

• SolrCloud– Distributed indexing and searching, NRT and

NoSQL features, e.g. realtime-get, optimistic concurrency and durable updates

– Sharding, replication, ZooKeeper ensemble– High availability with no single points of failure

• Real-time Get: Access latest document version, no commit or new searcher open required

• Atomic updates: incremental field add/update/increment via stored fields

• NRT: “soft” commits

Page 6: What's new in solr june 2014

6

A short history of Solr 4

• Solr Reference Guide now released with each feature release:– Live (targeting next Solr release):

http://s.apache.org/SolrReferenceGuide–Most recent released PDF:

http://s.apache.org/Solr-Ref-Guide-PDF– Previous release PDFs:

http://s.apache.org/Older-Solr-Ref-Guide-PDFs

Page 7: What's new in solr june 2014

7

A short history of Solr 4

• Flexible indexing– Solr core = Lucene index

• Lucene index = 1 or more segments

– Codec: per-segment suite of formats

• Flexible scoring– You can specify similarity implementation per

fieldType in your schema.xml if you use SchemaSimilarityFactory

– Built-in Similarities (other than the default TF-IDF):• Okapi BM25• Divergence from Randomness• Information-Based• Language Models (with two smoothing implementations)• SweetSpot

Page 8: What's new in solr june 2014

8

A short history of Solr 4

• DocValues: typed column stride fields– Document-to-value mapping built at index time– Reduced memory usage compared to field cache– Good for faceting and sorting– Missing values now supported as of Solr 4.5

• Pseudo-fields– Field aliasing, e.g. &fl=result:indexed– Function queries, aliasable too, e.g. &fl=price:sum(a,b)

– Document transformers• Standard: [explain], [value], [shard], [docid]

• Pseudo-joins, e.g. ?q={!join+from=manu+to=id}ipod• Pivot faceting: automatic drill-down (no distr.’d

support)

Page 9: What's new in solr june 2014

9

A short history of Solr 4

• Schema API• GET /collection/schema/fields/fieldname• PUT /collection/schema/fields/name

• JSON body: { "type":"text_general",      "stored":true,

"indexed":true }

• Schemaless mode• a.k.a. data-driven schema or field guessing• Class guessed based on field values, then

class(es) mapped to a fieldType; first gets added to the schema

• Supported value classes: Boolean, Integer, Long, Float, Double, and Date

Page 10: What's new in solr june 2014

10

A short history of Solr 4

• Document routing– CompositeId router, e.g. id=tenant!docid• Used by default when numShards specified

when creating a collection.• Restrict queries to shard(s): &_route_=tenant!

– Implicit router

• Online shard splitting– Allows collections to scale, rather than

having to decide on how much to overshard up front.

– Split in two; with custom hash ranges; or using split.key param to split to a dedicated shard

Page 11: What's new in solr june 2014

11

A short history of Solr 4

• Nested documents, a.k.a. Block Join– Nested doc to be added:

<add>  <doc>  <field name="id">1</field>  <field name="title">Solr adds block join support</field>  <field name="content_type">parentDocument</field>    <doc>     <field name="id">2</field>       <field name="comments">SolrCloud supports it too!</field>    </doc>  </doc></add>

– Queries:• Child query parser, e.g.

q={!child of="content_type:parentDocument"}title:Solr

• Parent query parser, e.g. q={!parent which="content_type:parentDocument"}comments:SolrCloud

Page 12: What's new in solr june 2014

12

A short history of Solr 4

• solr.xml legacy & discovery modes– Legacy mode (cores listed in solr.xml) is

deprecated; support will be removed in Solr 5.

– Discovery mode (new as of Solr 4.3):• No cores are listed in solr.xml• Cores are discovered by a recursive walk of the

solr home directory, marked by core.properties files • Nested core directories are not allowed

Page 13: What's new in solr june 2014

13

A short history of Solr 4

• New web admin UI with SolrCloud support

Page 14: What's new in solr june 2014

14

Solr 4.7 and 4.8: new features

• As of Solr 4.8, Java 7 is the minimum supported JVM version. Recommended: Oracle 1.7.0_60

• <fields> and <types> tags are no longer necessary in schema.xml

• Collections API improvements– Working toward “ZooKeeper = Truth” mode

• legacyCloud=false cluster property

– New actions:• CLUSTERSTATUS, LIST, ADDROLE, DELETEROLE,

ADDREPLICA, DELETEREPLICA, OVERSEERSTATUS, MIGRATE, CLUSTERPROP

– Core properties can be specified with CREATE and SPLITSHARD actions

Page 15: What's new in solr june 2014

15

Solr 4.7 and 4.8: new features

• Asynchronous execution of long-running actions– SolrCloud Collections API:

• CREATE, SPLITSHARD, MIGRATE

– CoreAdminHandler: • CREATE, RENAME, UNLOAD, SWAP, MERGEINDEXES,

SPLIT

– Tracking request ID supplied via async param– Track status via the new REQUESTSTATUS

action, using the tracking request ID• Possible states: running, complete, failed, notfound

– Clear stored statuses with special request ID -1

Page 16: What's new in solr june 2014

16

Solr 4.7 and 4.8: new features

• Cursors: Efficient Deep Paging– Request must include a sort, which must

include the uniqueKey, which must be defined

– First page: ?q=…&sort=id+asc&rows=N&cursorMark=*

• Response contains "nextCursorMark":"<base64encoded>"

– Following pages:?q=…&sort=id+asc&rows=N&cursorMark=<from response>

– Repeat; when nextCursorMark=cursorMark from the request, there are no more results

– No server-side state

Page 17: What's new in solr june 2014

17

Solr 4.7 and 4.8: new features

Page 18: What's new in solr june 2014

18

Solr 4.7 and 4.8: new features

• Document expiration and Time To Live (TTL)– Auto-delete expired documents

• DocExpirationUpdateProcessorFactory can periodically wake up and delete expired documents

– Compute expiration date from TTL• Update request _ttl_ param, or• Document _ttl_ field• Both names are configurable, defaulting to _ttl_.• _ttl_ values are interpreted as Date Math

Expressions relative to NOW, e.g. “+1YEAR”.

Page 19: What's new in solr june 2014

19

Solr 4.7 and 4.8: new features

• Dynamic synonyms and stopwords– “Managed” resources: configuration and content

for synonyms and stopwords, persistence managed by Solr

– Specified as ManagedSynonymFilterFactory and ManagedStopFilterFactory on analyzers in schema.xml

– CRUD operations are enabled via a REST endpoint per managed resource.

– The “managed” attribute names the REST endpoint, e.g.<filter class="solr.ManagedStopFilterFactory"         managed="french" />

– E.g. to delete stopword “le” from the “french” managed stoplist:curl -X DELETE "…/solr/colln/schema/analysis/stopwords/french/le"

Page 20: What's new in solr june 2014

20

Solr 4.7 and 4.8: new features

• SSL support in SolrCloud– URL scheme stored in ZooKeeper– SSL certificates are specifiable via system

properties, to enable authentication

• Nested documents may be specified in JSON format

• Tri-level compositeId routing– E.g. “tenant!group!docid”, 8/8/16 hash bits per

component

• Build Solr indexes with Hadoop’s MapReduce– +Mark Miller’s blog: http://bit.ly/1oh0fWq

• Github solr-map-reduce-example: http://bit.ly/1pnDAao

• Named config sets in non-SolrCloud mode– Default base directory is SOLR_HOME/configsets/

Page 21: What's new in solr june 2014

21

Solr 4.7 and 4.8: new features

• Suggester v2– Added BlendedInfixSuggester– Added FreeTextSuggester– Queries can use multiple suggesters

• New query parsing features– SimpleQParserPlugin: parser for human

entered queries with selectable operators.– ComplexPhraseQParserPlugin: wildcards, ORs,

etc. inside Phrase Queries• E.g. {!complexphrase inOrder=true}name:"Jo* Smith"

Page 22: What's new in solr june 2014

22

Solr 4.7 and 4.8: new features

• CollapsingQParserPlugin– Performant alternative grouping/field

collapsing implementation, for high distinct group cardinality.

• ExpandComponent– Expands collapsed groups– Can also expand nested documents

Page 23: What's new in solr june 2014

23

Solr 4.9 and beyond

• ZooKeeper = Truth / legacyCloud=false

• MODIFYCOLLECTION collections API–Modify maxShardsPerNode, replicationFactor

for the entire collection

• Incremental Field Updates on numeric DocValues– Binary DocValues IFUs also coming

• Multi-valued DocValues sort fields• Legacy numeric/date field types deprecated,

removed in Solr 5 in favor of Trie field types

Page 24: What's new in solr june 2014

24

Solr 4.9 and beyond

• In Solr 5, the .war will no longer be shipped• Index integrity: checksums• Integrity check on merge off by default• solrconfig.xml option

<indexConfig><checkIntegrityAtMerge>

• New update query param min_rf will allow clients to set the minimum successful replicas for the request

• Return Block Join child documents when parents match, via a new DocTransformer

[child parentFilter=“field:value”]

Page 25: What's new in solr june 2014

25

Solr 4.9 and beyond

• AnalyticsQuery: support pluggable, pipeline-able analytics, orderable via the “cost” parameter, like PostFilters.

• ReRankingQParserPlugin • Re-rank the top n results

Page 26: What's new in solr june 2014

26

Platform

LucidWorks Open Source

• Effortless AWS deployment and monitoring: http://www.github.com/lucidworks/solr-scale-tk

• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager

• Banana (Kibana for Solr): https://github.com/LucidWorks/banana

• Data Quality Toolkit: https://github.com/LucidWorks/data-quality

• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash

Page 27: What's new in solr june 2014

27

LinksSolr website: http://lucene.apache.org/solrSolr Reference Guide:

• Live (targeting next Solr release): http://s.apache.org/SolrReferenceGuide

• Most recent released PDF: http://s.apache.org/Solr-Ref-Guide-PDF

• Previous release PDFs: http://s.apache.org/Older-Solr-Ref-Guide-PDFs

Lucene/Solr Revolution: http://www.LuceneRevolution.org

Q & A