what's new in solr 3.x / 4.0

What’s New in Solr 3.x/4.0

Charlottesville Lucene/Solr MeetupAugust 15, 2011

Erik HatcherLucid Imagination

What is Solr?

• Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

What is Lucene?

• Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Solr History

• November 2009: Solr 1.4 (Lucene 2.9.1)

• June 2010: Solr 1.4.1 (Lucene 2.9.3)

• 2011

• March - Solr 3.1

• May - Solr 3.2

• July - Solr 3.3

Solr 3.1• Improved geospatial support

• Sorting by function queries

• Range faceting on all numeric fields

• Example Velocity driven search UI at http://localhost:8983/solr/browse

• A new termvector-based highlighter

• Improved spellchecking capabilities

• Improved integration with Apache Lucene

• New autosuggest component

• Distributed support for more components

• JSON document indexing and CSV response format

• Apache UIMA integration for metadata extraction

• Many other Bugfixes, improvements and optimizations

Major components

• Apache Lucene 3.1.0

• Apache Tika 0.8

• Carrot2 3.4.2

• Velocity 1.6.1 and Velocity Tools 2.0-beta3

• Apache UIMA 2.3.1-SNAPSHOT

Schema / Config• SOLR-1131: FieldTypes can now output multiple

Fields per Type and still be searched. This can be handy for hiding the details of a particular implementation such as in the spatial case.

• SOLR-1379: Add RAMDirectoryFactory for non-persistent in memory index storage.

• SOLR-2059: Add "types" attribute to WordDelimiterFilterFactory, which allows you to customize how WordDelimiterFilter tokenizes text with a configuration file.

Indexing

• SOLR-945: JSON update handler that accepts add, delete, commit commands in JSON format.

Geospatial• SOLR-1302: Added several new distance based functions,

including Great Circle (haversine), Manhattan, Euclidean and String (using the StringDistance methods in the Lucene spellchecker). Also added geohash(), deg() and rad() convenience functions. See http://wiki.apache.org/solr/FunctionQuery

• SOLR-1568: Added "native" filtering support for PointType, GeohashField. Added LatLonType with filtering support too. See http://wiki.apache.org/solr/SpatialSearch and the example. Refactored some items in Lucene spatial. Removed SpatialTileField as the underlying CartesianTier is broken beyond repair and is going to be moved.

Query Parsing• SOLR-1553: New dismax parser implementation (accessible as "edismax") that supports full

lucene syntax, improved reserved char escaping, fielded queries, improved proximity boosting, and improved stopword handling. Note: status is experimental for now.

• SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField. autoGeneratePhraseQueries="true" (the default) causes the query parser to generate phrase queries if multiple tokens are generated from a single non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:"pdp 11" rather than (text:PDP OR text:11). Note that autoGeneratePhraseQueries="true" tends to not work well for non whitespace delimited languages.

• SOLR-2128: Full parameter substitution for function queries. Example: q=add($v1,$v2)&v1=mul(popularity,5)&v2=20.0

• SOLR-2133: Function query parser can now parse multiple comma separated value sources. It also now fails if there is extra unexpected text after parsing the functions, instead of silently ignoring it. This allows expressions like q=dist(2,vector(1,2),$pt)&pt=3,4

Functions

• SOLR-1574: Add many new functions from java Math (e.g. sin, cos)

• SOLR-1569: Allow functions to take in literal strings by modifying the FunctionQParser and adding LiteralValueSource

• SOLR-1297: Add sort by Function capability

Analysis• SOLR-1923: PhoneticFilterFactory now has support for the Caverphone

algorithm.

• SOLR-1571: Added unicode collation support though Lucene's CollationKeyFilter

• SOLR-1653: Add PatternReplaceCharFilter

• SOLR-1677: Add support for choosing the Lucene Version for Lucene components within Solr.

• SOLR-1984: Add HyphenationCompoundWordTokenFilterFactory.

• SOLR-2188: Added "maxTokenLength" argument to the factories for ClassicTokenizer, StandardTokenizer, and UAX29URLEmailTokenizer.

• ICU integration

Analysis (cont.)• SOLR-1857: Synced Solr analysis with

Lucene 3.1. Added KeywordMarkerFilterFactory and StemmerOverrideFilterFactory, which can be used to tune stemming algorithms.

• Added factories for Bulgarian, Czech, Hindi, Turkish, and Wikipedia analysis. Improved the performance of SnowballPorterFilterFactory.

• SOLR-1657: Converted remaining TokenStreams to the Attributes-based API. All Solr TokenFilters now support custom Attributes, and some have improved performance: especially WordDelimiterFilter and CommonGramsFilter.

• SOLR-1740: ShingleFilterFactory supports the "minShingleSize" and "tokenSeparator" parameters for controlling the minimum shingle size produced by the filter, and the separator string that it uses, respectively.

• SOLR-744: ShingleFilterFactory supports the "outputUnigramsIfNoShingles" parameter, to output unigrams if the number of input tokens is fewer than minShingleSize, and no shingles can be generated.

• SOLR-1974: Add LimitTokenCountFilterFactory.

• SOLR-1057: Add PathHierarchyTokenizerFactory.

Faceting• SOLR-1240: "Range Faceting" has been added. This is a generalization

of the existing "Date Faceting" logic so that it now supports any all stock numeric field types that support range queries in addition to dates. facet.date is now deprecated in favor of this generalized mechanism.

• SOLR-397: Date Faceting now supports a "facet.date.include" param for specifying when the upper & lower end points of computed date ranges should be included in the range. Legal values are: "all", "lower", "upper", "edge", and "outer". For backwards compatibility the default value is the set: [lower,upper,edge], so that all ranges between start and end are inclusive of their endpoints, but the "before" and "after" ranges are not.

• SOLR-2325: Allow tagging and exclusion of main query for faceting.

SolrJ

• SOLR-1139: Add TermsComponent Query and Response Support in SolrJ

• SOLR-1815: SolrJ now preserves the order of facet queries.

Solr Components• SOLR-1316: Create autosuggest component

• SOLR-2010: Added ability to verify that spell checking collations have actual results in the index.

• SOLR-2157: Suggester should return alpha-sorted results when onlyMorePopular=false

• SOLR-1625: Add regexp support for TermsComponent

• SOLR-1556: TermVectorComponent now supports per field overrides. Also, it now throws an error if passed in fields do not exist and warnings if fields that do not have term vector options (termVectors, offsets, positions) that align with the schema declaration.

• SOLR-860: Add debug output for MoreLikeThis.

Highlighting

• SOLR-1268: Incorporate FastVectorHighlighter

• SOLR-2021: Add SolrEncoder plugin to Highlighter.

• SOLR-2030: Make FastVectorHighlighter use of SolrEncoder.

• SOLR-2053: Add support for custom comparators in Solr spellchecker, per LUCENE-2479

• SOLR-2049: Add hl.multiValuedSeparatorChar for FastVectorHighlighter, per LUCENE-2603.

Distributed

• SOLR-785: Distributed Search support for SpellCheckComponent

• SOLR-1177: Distributed Search support for TermsComponent

Misc.

• SOLR-1957: The VelocityResponseWriter contrib moved to core. Example search UI now available at http://localhost:8983/solr/browse

• SOLR-1966: QueryElevationComponent can now return just the included results in the elevation file

• SOLR-1925: Add CSVResponseWriter (use wt=csv) that returns the list of documents in CSV format.

• SOLR-2263: Add ability for RawResponseWriter to stream binary files as well as text files.

• SOLR-1750: SolrInfoMBeanHandler added for simpler programmatic access to info currently available from registry.jsp and stats.jsp

• SOLR-2099: Add ability to throttle rsync based replication using rsync option --bwlimit.

UIMA• UIMA - Unstructured Information Management

Architecture - http://uima.apache.org/

• Enables UIMA components to augment documents

• Entity extraction, automated categorization, language detection, etc

• "contrib" plugin - SOLR-2129

• http://wiki.apache.org/solr/SolrUIMA

Optimizations• SOLR-1679: Don't build up string messages in SolrCore.execute unless they

are necessary for the current log level.

• SOLR-1874: Optimize PatternReplaceFilter for better performance.

• SOLR-1968: speed up initial filter cache population for facet.method=enum and also big terms for multi-valued facet.method=fc. The resulting speedup for the first facet request is anywhere from 30% to 32x, depending on how many terms are in the field and how many documents match per term.

• SOLR-2089: Speed up UnInvertedField faceting (facet.method=fc for multi-valued fields) when facet.limit is both high, and a high enough percentage of the number of unique terms in the field. Extreme cases yield speedups over 3x.

• SOLR-2046: add common functions to scripts-util.

Solr 3.2

• Ability to specify overwrite and commitWithin as request parameters when using the JSON update format

• TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.

• DebugComponent now supports using a NamedList to model Explanation objects in it's responses instead of Explanation.toString

• Improvements to the UIMA and Carrot2 integrations

• Bugfixes and improvements from Apache Lucene 3.2

Other 3.2 goodies

• SOLR-2061: Pull base tests out into a new Solr Test Framework module, and publish binary, javadoc, and source test-framework jars.

• Dependency update: Carrot2 3.5.0

Solr 3.3

• Grouping / Field Collapsing

• A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption.

• KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English.

• Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See http://s.apache.org/merging for more information.

• Important bugfixes, including extremely high RAM usage in spellchecking.

• Bugfixes and improvements from Apache Lucene 3.3

http://s.apache.org/merging

http://s.apache.org/merging

Solr 3.3 details• SOLR-2378: A new, automaton-based, implementation of suggest (autocomplete)

component, offering an order of magnitude smaller memory consumption compared to ternary trees and jaspell and very fast lookups at runtime.

• SOLR-2400: Field- and DocumentAnalysisRequestHandler now provide a position history for each token, so you can follow the token through all analysis stages. The output contains a separate int[] attribute containing all positions from previous Tokenizers/TokenFilters (called "positionHistory").

• SOLR-2524: (SOLR-236, SOLR-237, SOLR-1773, SOLR-1311) Grouping / Field collapsing using the Lucene grouping contrib. The search result can be grouped by field and query.

• SOLR-1331: Added a srcCore parameter to CoreAdminHandler's mergeindexes action to merge one or more cores' indexes to a target core.

• SOLR-2610 -- Add an option to delete index through CoreAdmin UNLOAD action

Solr 4.0

• aka "trunk" at the moment

• major changes! (for the better!) at both Lucene and Solr levels

Lucene 4.0

• The postings APIs have been removed in favor of the new flexible indexing (flex) APIs.

• With flexible indexing it is now possible for an application to create its own postings codec, to alter how fields, terms, docs and positions are encoded into the index.

• String -> BytesRef

• Per-segment everything

4.0 details

• Directory.copy/Directory.copyTo now copies all files (not just index files), since what is and isn't and index file is now dependent on the codecs used.

• String to BytesRef

• FuzzyQuery and WildcardQuery now operate on Unicode codepoints, not unicode code units.

• WildcardQuery and QueryParser now allows escaping with the '\' character.

• Similarity can now be configured on a per-field basis

Relevancy

• more flexible scoring

NRT

• per-segment

• IndexWriter#commit now doesn't block concurrent indexing while flushing all 'currently' RAM resident documents to disk.

More Lucene 4.0 features

• Added RegexpQuery support to QueryParser.

• Adds AutomatonQuery, a MultiTermQuery that matches terms against a finite-state machine. Implement WildcardQuery and FuzzyQuery with finite-state methods. Adds RegexpQuery.

• The QueryParser now accepts mixed inclusive and exclusivebounds for range queries. Example: "{3 TO 5]"

Solr 4.0

• Pivot faceting

• Direct Solr spell checker

• Increased response writing flexibility (e.g. function query results)

• Distributed date/numeric range faceting

• "join" query parser

• NRT: You may now specify a 'soft' commit when committing. This will use Lucene's NRT feature to avoid guaranteeing documents are on stable storage in exchange for faster reopen times. There is also a new 'soft' autocommit tracker that can be configured.

About Lucid...

• Lucid Imagination provides commercial-grade support, training, high-level consulting and value-added software for Lucene and Solr.

• We make Lucene ‘enterprise-ready’ by offering:

• Free, certified, distributions and downloads.

• Support, training, and consulting.

• LucidWorks Enterprise, a commercial search platform built on top of Solr.

• http://www.lucidimagination.com

Lucid Offerings

LucidFind

http://www.lucidimagination.com/search/?q=charlottesville

what's new in solr 3.x / 4.0

Technology