lucene for solr developers

62
Lucene for Solr Developers erik . hatcher @ 1

Upload: erik-hatcher

Post on 11-May-2015

1.991 views

Category:

Technology


1 download

DESCRIPTION

You’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.

TRANSCRIPT

Page 1: Lucene for Solr Developers

Lucene for Solr Developers

erik . hatcher@

1

Page 2: Lucene for Solr Developers

AbstractYou’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.

2

Page 3: Lucene for Solr Developers

About me...

• Co-author, “Lucene in Action”

• Commiter, Lucene and Solr

• Lucene PMC and ASF member

• Member of Technical Staff / co-founder, Lucid Imagination

3

Page 4: Lucene for Solr Developers

... works

search platform

www.lucidimagination.com

4

Page 5: Lucene for Solr Developers

What is Lucene?• An open source search library (not an application)

• 100% Java

• Continuously improved and tuned over more than 10 years

• Compact, portable index representation

• Programmable text analyzers, spell checking and highlighting

• Not a crawler or a text extraction tool

5

Page 6: Lucene for Solr Developers

Inverted Index• Lucene stores input data in what is known as an

inverted index

• In an inverted index each indexed term points to a list of documents that contain the term

• Similar to the index provided at the end of a book

• In this case "inverted" simply means the list of terms point to documents

• It is much faster to find a term in an index, than to scan all the documents

6

Page 7: Lucene for Solr Developers

Inverted Index Example

7

Page 8: Lucene for Solr Developers

Segments and Merging• A Lucene index is a collection of one or more sub-indexes

called segments

• Each segment is a fully independent index

• A multi-way merge algorithm is used to periodically merge segments

• New segments are created when an IndexWriter flushes new documents and pending deletes to disk

• Trying for a balance between large-scale performance vs. small-scale updates

• Optimization merges all segments into one

8

Page 9: Lucene for Solr Developers

Segments and Merging

9

Page 10: Lucene for Solr Developers

Segments

• When a document is deleted it still exists in an index segment until that segment is merged

• At certain trigger points, these Documents are flushed to the Directory

• Can be forced by calling commit

• Segments are periodically merged

10

Page 11: Lucene for Solr Developers

IndexSearcher

11

Page 12: Lucene for Solr Developers

Adding new documents

12

Page 13: Lucene for Solr Developers

Commit

13

Page 14: Lucene for Solr Developers

Committed and Warmed

14

Page 15: Lucene for Solr Developers

Lucene Scoring

• Lucene uses a similarity scoring formula to rank results by measuring the similarity between a query and the documents that match the query. The factors that form the scoring formula are:

• Term Frequency: tf (t in d). How often the term occurs in the document.

• Inverse Document Frequency: idf (t). A measure of how rare the term is in the whole collection. One over the number of times the term appears in the collection.

• Terms that are rare throughout the entire collection score higher.

15

Page 16: Lucene for Solr Developers

Coord and Norms• Coord: The coordination factor, coord (q, d).

Boosts documents that match more of the search terms than other documents.

• If 4 of 4 terms match coord = 4/4

• If 3 of 4 terms match coord = 3/4

• Length Normalization - Adjust the score based on length of fields in the document.

• shorter fields that match get a boost

16

Page 17: Lucene for Solr Developers

Scoring Factors (cont)

• Boost: (t.field in d). A way to boost a field or a whole document above others.

• Query Norm: (q). Normalization value for a query, given the sum of the squared weights of each of the query terms.

• You will often hear the Lucene scoring simply referred to as TF·IDF.

17

Page 18: Lucene for Solr Developers

Explanation

• Lucene has a feature called Explanation

• Solr uses the debugQuery parameter to retrieve scoring explanations

0.2987913 = (MATCH) fieldWeight(text:lucen in 688), product of: 1.4142135 = tf(termFreq(text:lucen)=2) 9.014501 = idf(docFreq=3, maxDocs=12098) 0.0234375 = fieldNorm(field=text, doc=688)

18

Page 19: Lucene for Solr Developers

Lucene Core

• IndexWriter

• Directory

• IndexReader, IndexSearcher

• analysis: Analyzer, TokenStream, Tokenizer,TokenFilter

• Query

19

Page 20: Lucene for Solr Developers

Solr Architecture

20

Page 21: Lucene for Solr Developers

Customizing - Don't do it!

• Unless you need to.

• In other words... ensure you've given the built-in capabilities a try, asked on the e-mail list, and spelunked into at least Solr's code a bit to make some sense of the situation.

• But we're here to roll up our sleeves, because we need to...

21

Page 22: Lucene for Solr Developers

But first...• Look at Lucene and/or Solr source code as

appropriate

• Carefully read javadocs and wiki pages - lots of tips there

• And, hey, search for what you're trying to do...

• Google, of course

• But try out LucidFind and other Lucene ecosystem specific search systems - http://www.lucidimagination.com/search/

22

Page 23: Lucene for Solr Developers

Extension points

• Tokenizer, TokenFilter, CharFilter

• SearchComponent

• RequestHandler

• ResponseWriter

• FieldType

• Similarity

• QParser

• DataImportHandler hooks

• data sources

• entity processors

• transformers

• several others

23

Page 24: Lucene for Solr Developers

Factories

• FooFactory (most) everywhere. Sometimes there's BarPlugin style

• for sake of discussion... let's just skip the "factory" part

• In Solr, Factories and Plugins are used by configuration loading to parameterize and construct

24

Page 25: Lucene for Solr Developers

"Installing" plugins

• Compile .java to .class, JAR it up

• Put JAR files in either:

• <solr-home>/lib

• a shared lib when using multicore

• anywhere, and register location in solrconfig.xml

• Hook in plugins as appropriate

25

Page 26: Lucene for Solr Developers

Multicore sharedLib

<solr sharedLib="/usr/local/solr/customlib" persistent="true"> <cores adminPath="/admin/cores"> <core instanceDir="core1" name="core1"/> <core instanceDir="core2" name="core2"/> </cores></solr>

26

Page 27: Lucene for Solr Developers

Plugins via solrconfig.xml

• <lib dir="/path/to/your/custom/jars" />

27

Page 28: Lucene for Solr Developers

Analysis

• CharFilter

• Tokenizer

• TokenFilter

28

Page 29: Lucene for Solr Developers

Primer

• Tokens, Terms

• Attributes: Type, Payloads, Offsets, Positions, Term Vectors

• part of the picture:

29

Page 30: Lucene for Solr Developers

Version

• enum:

• Version.LUCENE_31, Version.LUCENE_32, etc

• Version.onOrAfter(Version other)

30

Page 31: Lucene for Solr Developers

CharFilter

• extend BaseCharFilter

• enables pre-tokenization filtering/morphing of incoming field value

• only affects tokenization, not stored value

• Built-in CharFilters: HTMLStripCharFilter, PatternReplaceCharFilter, and MappingCharFilter

31

Page 32: Lucene for Solr Developers

Tokenizer• common to extend CharTokenizer

• implement -

• protected abstract boolean isTokenChar(int c);

• optionally override -

• protected int normalize(int c)

• extend Tokenizer directly for finer control

• Popular built-in Tokenizers include: WhitespaceTokenizer, StandardTokenizer, PatternTokenizer, KeywordTokenizer, ICUTokenizer

32

Page 33: Lucene for Solr Developers

TokenFilter

• a TokenStream whose input is another TokenStream

• Popular TokenFilters include: LowerCaseFilter, CommonGramsFilter, SnowballFilter, StopFilter, WordDelimiterFilter

33

Page 34: Lucene for Solr Developers

Lucene's analysis APIs

• tricky business, what with Attributes (Source/Factory's), State, characters, code points, Version, etc...

• Test!!!

• BaseTokenStreamTestCase

• Look at Lucene and Solr's test cases

34

Page 35: Lucene for Solr Developers

Solr's Analysis Tools

• Admin analysis tool

• Field analysis request handler

• DEMO

35

Page 36: Lucene for Solr Developers

Query Parsing

• String -> org.apache.lucene.search.Query

36

Page 37: Lucene for Solr Developers

QParserPlugin

public abstract class QParserPlugin implements NamedListInitializedPlugin {

public abstract QParser createParser( String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req);}

37

Page 38: Lucene for Solr Developers

QParser

public abstract class QParser {

public abstract Query parse() throws ParseException;

}

38

Page 39: Lucene for Solr Developers

Built-in QParsersfrom QParserPlugin.java /** internal use - name to class mappings of builtin parsers */ public static final Object[] standardPlugins = { LuceneQParserPlugin.NAME, LuceneQParserPlugin.class, OldLuceneQParserPlugin.NAME, OldLuceneQParserPlugin.class, FunctionQParserPlugin.NAME, FunctionQParserPlugin.class, PrefixQParserPlugin.NAME, PrefixQParserPlugin.class, BoostQParserPlugin.NAME, BoostQParserPlugin.class, DisMaxQParserPlugin.NAME, DisMaxQParserPlugin.class, ExtendedDismaxQParserPlugin.NAME, ExtendedDismaxQParserPlugin.class, FieldQParserPlugin.NAME, FieldQParserPlugin.class, RawQParserPlugin.NAME, RawQParserPlugin.class, TermQParserPlugin.NAME, TermQParserPlugin.class, NestedQParserPlugin.NAME, NestedQParserPlugin.class, FunctionRangeQParserPlugin.NAME, FunctionRangeQParserPlugin.class, SpatialFilterQParserPlugin.NAME, SpatialFilterQParserPlugin.class, SpatialBoxQParserPlugin.NAME, SpatialBoxQParserPlugin.class, JoinQParserPlugin.NAME, JoinQParserPlugin.class, };

39

Page 40: Lucene for Solr Developers

Local Parameters

• {!qparser_name param=value}expression

• or

• {!qparser_name param=value v=expression}

• Can substitute $references from request parameters

40

Page 41: Lucene for Solr Developers

Param Substitution

solrconfig.xml<requestHandler name="/document" class="solr.SearchHandler"> <lst name="invariants"> <str name="q">{!term f=id v=$id}</str> </lst></requestHandler>

Solr requesthttp://localhost:8983/solr/document?id=FOO37

41

Page 42: Lucene for Solr Developers

Custom QParser

• Implement a QParserPlugin that creates your custom QParser

• Register in solrconfig.xml

• <queryParser name="myparser" class="com.mycompany.MyQParserPlugin"/>

42

Page 43: Lucene for Solr Developers

Update Processor

• Responsible for handling these commands:

• add/update

• delete

• commit

• merge indexes

43

Page 44: Lucene for Solr Developers

Built-in Update Processors

• RunUpdateProcessor

• Actually performs the operations, such as adding the documents to the index

• LogUpdateProcessor

• Logs each operation

• SignatureUpdateProcessor

• duplicate detection and optionally rejection

44

Page 45: Lucene for Solr Developers

UIMA Update Processor

• UIMA - Unstructured Information Management Architecture - http://uima.apache.org/

• Enables UIMA components to augment documents

• Entity extraction, automated categorization, language detection, etc

• "contrib" plugin

• http://wiki.apache.org/solr/SolrUIMA

45

Page 46: Lucene for Solr Developers

Update Processor Chain

• UpdateProcessor's sequence into a chain

• Each processor can abort the entire update or hand processing to next processor in the chain

• Chains, of update processor factories, are specified in solrconfig.xml

• Update requests can specify an update.processor parameter

46

Page 47: Lucene for Solr Developers

Default update processor chain

From SolrCore.java// construct the default chainUpdateRequestProcessorFactory[] factories = new UpdateRequestProcessorFactory[]{ new RunUpdateProcessorFactory(), new LogUpdateProcessorFactory() };

Note: these steps have been swapped on trunk recently

47

Page 48: Lucene for Solr Developers

Example Update Processor

• What are the best facets to show for a particular query? Wouldn't it be nice to see the distribution of document "attributes" represented across a result set?

• Learned this trick from the Smithsonian, who were doing it manually - add an indexed field containing the field names of the interesting other fields on the document.

• Facet on that field "of field names" initially, then request facets on the top values returned.

48

Page 49: Lucene for Solr Developers

Config for custom update processor

<updateRequestProcessorChain name="fields_used" default="true"> <processor class="solr.processor.FieldsUsedUpdateProcessorFactory"> <str name="fieldsUsedFieldName">attribute_fields</str> <str name="fieldNameRegex">.*_attribute</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>

49

Page 50: Lucene for Solr Developers

FieldsUsedUpdateProcessorFactory

public class FieldsUsedUpdateProcessorFactory extends UpdateRequestProcessorFactory { private String fieldsUsedFieldName; private Pattern fieldNamePattern;

public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new FieldsUsedUpdateProcessor(req, rsp, this, next); }

// ... next slide ...

}

50

Page 51: Lucene for Solr Developers

FieldsUsedUpdateProcessorFactory @Override public void init(NamedList args) { if (args == null) return;

SolrParams params = SolrParams.toSolrParams(args);

fieldsUsedFieldName = params.get("fieldsUsedFieldName"); if (fieldsUsedFieldName == null) { throw new SolrException (SolrException.ErrorCode.SERVER_ERROR, "fieldsUsedFieldName must be specified"); }

// TODO check that fieldsUsedFieldName is a valid field name and multiValued

String fieldNameRegex = params.get("fieldNameRegex"); if (fieldNameRegex == null) { throw new SolrException (SolrException.ErrorCode.SERVER_ERROR, "fieldNameRegex must be specified"); } fieldNamePattern = Pattern.compile(fieldNameRegex);

super.init(args); }

51

Page 52: Lucene for Solr Developers

class FieldsUsedUpdateProcessor extends UpdateRequestProcessor { public FieldsUsedUpdateProcessor(SolrQueryRequest req, SolrQueryResponse rsp, FieldsUsedUpdateProcessorFactory factory, UpdateRequestProcessor next) { super(next); }

@Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument();

Collection<String> incomingFieldNames = doc.getFieldNames();

Iterator<String> iterator = incomingFieldNames.iterator(); ArrayList<String> usedFields = new ArrayList<String>(); while (iterator.hasNext()) { String f = iterator.next(); if (fieldNamePattern.matcher(f).matches()) { usedFields.add(f); } }

doc.addField(fieldsUsedFieldName, usedFields.toArray()); super.processAdd(cmd); }}

52

Page 53: Lucene for Solr Developers

FieldsUsedUpdateProcessorin action

schema.xml <dynamicField name="*_attribute" type="string" indexed="true" stored="true" multiValued="true"/>

Add some documentssolr.add([{:id=>1, :name => "Big Blue Shoes", :size_attribute => 'L', :color_attribute => 'Blue'}, {:id=>2, :name => "Cool Gizmo", :memory_attribute => "16GB", :color_attribute => 'White'}])solr.commit

Facet on attribute_fields - http://localhost:8983/solr/select?q=*:*&facet=on&facet.field=attribute_fields&wt=json&indent=on "facet_fields":{ "attribute_fields":[ "color_attribute",2, "memory_attribute",1, "size_attribute",1]}

53

Page 54: Lucene for Solr Developers

Search Components

• Built-in: Clustering, Debug, Facet, Highlight, MoreLikeThis, Query, QueryElevation, SpellCheck, Stats, TermVector, Terms

• Non-distributed API:

• prepare(ResponseBuilder rb)

• process(ResponseBuilder rb)

54

Page 55: Lucene for Solr Developers

Example - auto facet select

• It sure would be nice if you could have Solr automatically select field(s) for faceting based dynamically off the profile of the results. For example, you're indexing disparate types of products, all with varying attributes (color, size - like for apparel, memory_size - for electronics, subject - for books, etc), and a user searches for "ipod" where most products match products with color and memory_size attributes... let's automatically facet on those fields.

• https://issues.apache.org/jira/browse/SOLR-2641

55

Page 56: Lucene for Solr Developers

AutoFacetSelectionComponent

• Too much code for a slide, let's take a look in an IDE...

• Basically -

• process() gets autofacet.field and autofacet.n request params, facets on field, takes top N values, sets those as facet.field's

• Gotcha - need to call rb.setNeedDocSet(true) in prepare() as faceting needs it

56

Page 57: Lucene for Solr Developers

SearchComponent config

<searchComponent name="autofacet" class="solr.AutoFacetSelectionComponent"/><requestHandler name="/searchplus" class="solr.SearchHandler"> <arr name="components"> <str>query</str> <str>autofacet</str> <str>facet</str> <str>debug</str> </arr></requestHandler>

57

Page 58: Lucene for Solr Developers

autofacet successhttp://localhost:8983/solr/searchplus?q=*:*&facet=on&autofacet.field=attribute_fields&wt=json&indent=on{ "response":{"numFound":2,"start":0,"docs":[ { "size_attribute":["L"], "color_attribute":["Blue"], "name":"Big Blue Shoes", "id":"1", "attribute_fields":["size_attribute", "color_attribute"]}, { "color_attribute":["White"], "name":"Cool Gizmo", "memory_attribute":["16GB"], "id":"2", "attribute_fields":["color_attribute", "memory_attribute"]}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "color_attribute":[ "Blue",1, "White",1], "memory_attribute":[ "16GB",1]}}}

58

Page 59: Lucene for Solr Developers

Distributed-aware SearchComponents

• SearchComponent has a few distributed mode methods:

• distributedProcess(ResponseBuilder)

• modifyRequest(ResponseBuilder rb, SearchComponent who, ShardRequest sreq)

• handleResponses(ResponseBuilder rb, ShardRequest sreq)

• finishStage(ResponseBuilder rb)

59

Page 60: Lucene for Solr Developers

Testing

• AbstractSolrTestCase

• SolrTestCaseJ4

• SolrMeter

• http://code.google.com/p/solrmeter/

60

Page 61: Lucene for Solr Developers

For more information...• http://www.lucidimagination.com

• LucidFind

• search Lucene ecosystem: mailing lists, wikis, JIRA, etc

• http://search.lucidimagination.com

• Getting started with LucidWorks Enterprise:

• http://www.lucidimagination.com/products/lucidworks-search-platform/enterprise

• http://lucene.apache.org/solr - wiki, e-mail lists

61

Page 62: Lucene for Solr Developers

Thank You!

62