lucene open source search engine. lucene - overview complete search engine in a java library...
TRANSCRIPT
![Page 1: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/1.jpg)
Lucene
Open Source Search Engine
![Page 2: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/2.jpg)
Lucene - Overview
• Complete search engine in a Java library• Stand-alone only, no server– But can use SOLR
• Handles indexing and query• Fully featured – but not 100% complete• Customizable – to an extent• Fully open source• Current version: 3.6.1
![Page 3: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/3.jpg)
Lucene Implementations• LinkedIn
– OS software on integer list compression• Eclipse IDE
– For searching documentation• Jira• Twitter• Comcast
– XfinityTV.com, some set top boxes• Care.com• MusicBrainz• Apple, Disney• BobDylan.com
![Page 4: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/4.jpg)
Indexing
Lucene
![Page 5: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/5.jpg)
Lucene - Indexing• Directory = A reference to an Index– RAMDirectory, SimpleFSDirectory
• IndexWriter = Writes to the index, options:– Limited or unlimited field lengths– Auto commit– Analyzer (how to do text processing, more on this later)– Deletion Policy (only for deleting old temporary data)
• Document – Holds fields to index• Field – A name/value pair + index/store flags
![Page 6: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/6.jpg)
Lucene – Indexer OutlineSimpleFSDirectory fsDir = new SimpleFSDirectory(File)IndexWriter iWriter = new IndexWriter(fsDir,…)Loop: fetch text for each document {
Document doc = new Document();doc.add(new Field(…)); // for each fieldiWriter.addDocument(doc);
}iWriter.commit();iWriter.close();fsDir.close();
![Page 7: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/7.jpg)
Class Materials
• SharePoint link– use “search\[flast]” username– sharepoint.searchtechnologies.com– Annual Kickoff– Shared Documents– FY2013 Presentations– Introduction to Lucene
• lucene-training-src-FY2013.zip
![Page 8: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/8.jpg)
Lucene – Index – Exercise 1• Create A new Maven Project
– mvn archetype:generate -DgroupId=com.searchtechnologies -DartifactId=lucene-training -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
– Right click pom.xml, Maven.. Add Dependency• lucene-core in search box• Choose 3.6.1• Expand Maven Dependencies.. Right click lucene-core.. Maven download
sources
– Source code level = 1.6• Copy Source File: LuceneIndexExercise.java
– Into com.searchtechnologies package• Copy data directory to your project• Follow instructions in the file
![Page 9: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/9.jpg)
Query
Lucene
![Page 10: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/10.jpg)
Lucene - Query• Directory = An index reference• IndexReader = Reads the index, typically
associated with reading document fields– readOnly
• IndexSearcher = Searches the Index• QueryParser – Parses a string to a Query– QueryParser = Standard Lucene Parser– Constructor: Version, default field, analyzer
• Query – Query expression to execute– Returned by qParser.parse(String)– Search Tech’s QPL can generate Query objects
![Page 11: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/11.jpg)
Lucene – Query part 2• Executing a Search– TopDocs td =
iSearcher.search(<query-object>, <num-docs>)• TopDocs – Holds statistics on the search plus the
top N documents– totalHits, scoreDocs[], maxScore
• ScoreDoc –Information on a single document– Doc ID and score
• Use IndexReader to fetch any Document from a Doc ID– (includes all fields for the document)
![Page 12: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/12.jpg)
Lucene – Search OutlineSimpleFSDirectory fsDir = new SimpleFSDirectory(File f)IndexReader iReader = new IndexReader(fsDir,…)IndexSearcher iSearcher = new IndexSearcher(iReader)StandardAnalyzer sa = new StandardAnalyzer(…)QueryParser qParser = new QueryParser(…)Loop: fetch a query from the user {
Query q = qParser.parse( <query string> )TopDocs tds = iSearcher.search(q, 10);
Loop: For every document in tds.scoreDocs { Document doc = iReader.document(tds.scoreDocs[i].doc);
Print: tds.scoreDocs[i].score, doc.get(“field”)}
}// Close the StandardAnalyzer, iSearcher, and iReader
![Page 13: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/13.jpg)
Lucene – Query – Exercise 2
• Open Source File: LuceneQueryExercise.java• Follow instructions in the file
![Page 14: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/14.jpg)
Relevancy Tuning
Lucene
![Page 15: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/15.jpg)
Lucene Extras – Fun Things You Can Do
• iWriter.updateDocument(Term, Document)– Updates a document which contains the “Term”– “Term” in this case is a field/value pair• Such as “id” = “3768169”
• doc.boost( <float boost value> )– Multiplies term weights in the doc by boost value– Part of “fieldNorm” when you do an “explain”
• field.boost( <float boost value> )– Multiplies term weights in field by boost value
![Page 16: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/16.jpg)
Explain - Example
iSearcher.explain(query, doc-number)
Query: star OR catch^0.6 for document 903
1.2778056 = (MATCH) product of: 2.5556111 = (MATCH) sum of: 2.5556111 = (MATCH) weight(title:catch^0.6 in 903), product of: 0.56637216 = queryWeight(title:catch^0.6), product of: 0.6 = boost 7.2195954 = idf(docFreq=1, maxDocs=1005) 0.13074881 = queryNorm 4.512247 = (MATCH) fieldWeight(title:catch in 903), product of: 1.0 = tf(termFreq(title:catch)=1) 7.2195954 = idf(docFreq=1, maxDocs=1005) 0.625 = fieldNorm(field=title, doc=903) 0.5 = coord(1/2)
![Page 17: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/17.jpg)
Lucene – Query– Exercise 3
• Add explain to your query programExplanation exp = iSearcher.explain( . . . )
• Call it for all documents produced by your search
• Simply use toString() on the result of explain() to display the results
![Page 18: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/18.jpg)
Boosting – Other Issues
• Similarity Class Javadoc Documentation– Very useful discussion of boosting formulas
• Similarity.encodeNormValue() – 8-bit floating point!
0.00 => 00.10 => 6E0.20 => 720.30 => 740.40 => 760.50 => 780.60 => 780.70 => 790.80 => 7A0.90 => 7B1.00 => 7C1.10 => 7C1.20 => 7C
1.30 => 7D1.40 => 7D1.50 => 7E1.60 => 7E1.70 => 7E1.80 => 7F1.90 => 7F2.00 => 802.10 => 802.20 => 802.30 => 802.40 => 802.50 => 80
2.60 => 812.70 => 812.80 => 812.90 => 813.00 => 813.10 => 823.20 => 823.30 => 823.40 => 823.50 => 823.60 => 833.70 => 833.80 => 83
3.90 => 834.00 => 834.10 => 844.20 => 844.30 => 844.40 => 844.50 => 844.60 => 844.70 => 844.80 => 844.90 => 845.00 => 84
![Page 19: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/19.jpg)
Lucene Query Objects
• Query objects are used to execute the search
QueryParser
QueryString
iSearcher.search()
TopDocs
All Derived from the Lucene Query class
![Page 20: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/20.jpg)
Lucene Query Objects - Example
(george AND washington) OR (thomas AND jefferson)
BooleanQuery(clauses = SHOULD)
BooleanQuery(clauses = MUST)
TermQuerygeorge
TermQuerywashington
TermQuerythomas
TermQueryjefferson
BooleanQuery (clauses = MUST)
![Page 21: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/21.jpg)
Lucene BooleanQuerygeorge +washington -martha jefferson -thomas +sally
WORKS LIKE AND:
BooleanQuery bq = new BooleanQuery();bq.add( X , Occur.MUST);bq.add( Y , Occur.MUST);
WORKS LIKE OR:
BooleanQuery bq = new BooleanQuery();bq.add( X , Occur.SHOULD);bq.add( Y , Occur.SHOULD);
WORKS LIKE: X AND (X OR Y)
BooleanQuery bq = new BooleanQuery();bq.add( X , Occur.MUST);bq.add( Y , Occur.SHOULD);
![Page 22: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/22.jpg)
Lucene – Query– Exercise 4• Create BooleanQuery and TermQuery objects as
necessary to create a query without the query parser• Goal: (star AND wars) OR (blazing AND saddles)• TermQuery:
tq = new TermQuery(new Term("field","token"))
• BooleanQuery:BooleanQuery bq = new BooleanQuery();bq.add( <nested query object> , Occur.MUST);bq.add(<nested query object> , Occur.MUST);
• Occur– Occur.MUST, Occur.SHOULD, Occur.MUST_NOT
• TermQuery and BooleanQuery derive from Query– Any “Query” objects can be passed to iSearcher.search()
![Page 23: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/23.jpg)
Lucene Proximity Queries• “Spanning” Queries Return matching “spans”
DOCUMENT: Four score and seven years ago, our forefathers brought forth…
Query: Returns:
four before/5 seven 0:4
(four before/5 seven) before forefathers 0:8
brought near/3 ago 5:9
(four adj score) or (brought adj forth) 0:2, 8:10
0 1 2 3 4 5 6 7 8 9 10
Word positions mark word boundaries
![Page 24: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/24.jpg)
Proximity Queries : Available Operators
• (standard) SpanTermQuery– For terms inside spanning queries
• (standard) SpanNearQuery– Inorder flag handles both near and before
• (standard) SpanOrQuery• (standard) SpanMultiTermQueryWrapper– fka SpanRegexQuery
• (Search Tech) SpanAndQuery• (SearchTech) SpanBetweenQuery– between(start,end,positive-content,not-content)
![Page 25: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/25.jpg)
Span Queries
• demo of LuceneSpanDemo.java
![Page 26: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/26.jpg)
Analysis
Lucene
![Page 27: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/27.jpg)
Analyzers• “Analysis” = “Text Processing” in Lucene• Includes:– Tokenization
• Since 1955, the B-52… Since, 1955, the, B, 52
– Token filtering• Splitting, joining, replacing, filtering, etc.• Since, 1955, the, B, 52 1955, B, 52• George, Lincoln george, lincoln• MasteringBiology Mastering, Biology• B-52 B52, B-52, B, 52
– Stemming• tables table• carried carry
![Page 28: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/28.jpg)
Analyzer
Analyzer, Tokenizer, TokenFilter
• Tokenizer: Text TokenStream• TokenFilter: TokenStream TokenStream• Analyzer: A complete text processing function
(one tokenizer + multiple token filters)– Manufactures TokenStreams
Tokenizer TokenFilter TokenFilter . . .string
![Page 29: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/29.jpg)
Existing Analyzers, Tokenizers, Filters
• Tokenizer– (Standard) CharTokenizer, WhitespaceTokenizer,
KeywordTokenizer, ChineseTokenizer, CJKTokenizer, StandardTokenizer, WikipediaTokenizer (more)
– (Search Tech) UscodeTokenizer (produces each HTML <tag> as a separate token)
• TokenFilter– Stemmers: (Standard) many language-specific
stemmers, PorterStemFilter, SnoballFilter– Stemmers: (Search Tech) Lemmatizer
![Page 30: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/30.jpg)
Existing Analyzers, Tokenizers, Filters
• TokenFilters (continued)– LengthFilter, LowerCaseFilter, StopFilter,
SynonymTokenFilter (don’t use), WordDelimiterFilter (SOLR only)
• Analyzers– WhitespaceAnalyzer, StandardAnalyzer, various
language analyzers, PatternAnalyzer
Analyzers almost always need to be customized.
![Page 31: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/31.jpg)
Creating and Using TokenStreamTokenStream tokenStream = new SomeTokenizer(…);tokenStream = new SomeTokenFilter1(tokenStream);tokenStream = new SomeTokenFilter2(tokenStream);
CharTermAttribute charTermAtt =tokenStream.getAttribute(CharTermAttribute.class);
OffsetAttribute offsetAtt =tokenStream.getAttribute(OffsetAttribute.class);
while (tokenStream.incrementToken()) {charTermAtt Now contains info on the token’s termoffsetAtt.startOffset() Now contains the token’s start offset
}
![Page 32: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/32.jpg)
Token Streams - How They Work
TokenFilterCall incrementToken()
TokenFilter
Call incrementToken()
Tokenizer
Call incrementToken()
Get next tokenfrom Reader() store in
Attribute objects
Return
Modify attribute objectsand return
Modify attribute objectsand return
Use Attribute Objects
![Page 33: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/33.jpg)
Creating and Using TokenStream
DEMO
![Page 34: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/34.jpg)
Replacement Pattern
TokenFilter
incrementToken()
Call incrementToken()
Modify attribute objectsand return
Token Filters Simply Modify Attributes that Pass Through
![Page 35: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/35.jpg)
Token Filter – Replacement Patternpublic final class LowerCaseFilter extends TokenFilter { public LowerCaseFilter(TokenStream input) { super(input); termAtt = (CharTermAttribute)
addAttribute(CharTermAttribute.class); }
private CharTermAttribute termAtt;
public final boolean incrementToken() throws IOException { if (input.incrementToken()) { final char[] buffer = termAtt.buffer(); final int length = termAtt.length();
for(int i=0;i<length;i++) buffer[i] = Character.toLowerCase(buffer[i]);
return true; } else return false; }}
![Page 36: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/36.jpg)
Deletion Pattern
TokenFilter
incrementToken()
Call incrementToken()
Token Filters Check Token Attributes and May Call incrementToken() Multiple Times
Keep Looping UntilA Good Token is Found
Then Return It
![Page 37: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/37.jpg)
Token Filter – Deletion Patternpublic final class TokenLengthLessThan50CharsFilter extends TokenFilter { public TokenLengthLessThan50CharsFilter(TokenStream in) { super(in); termAtt = (CharTermAttribute) addAttribute(CharTermAttribute.class); posIncrAtt = (PositionIncrementAttribute) addAttribute(PositionIncrementAttribute.class); }
private CharTermAttribute termAtt; private PositionIncrementAttribute posIncrAtt;
public final boolean incrementToken() throws IOException { int skippedPositions = 0; while(input.incrementToken()) { final int length = termAtt.length();
if(length > 50) { skippedPositions += posIncrAtt.getPositionIncrement(); continue; }
posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
return true; } return false; }}
![Page 38: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/38.jpg)
Splitting Tokens Pattern – First Call
TokenFilter
incrementToken()
Call incrementToken()
When Splitting a Token, Save the Splits Aside For Later
Saved Token
Split Token
ReturnFirstHalf
SaveSecond
Half
![Page 39: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/39.jpg)
Splitting Tokens Pattern – Second Call
TokenFilter
incrementToken()
When Called the Second Time, Just Return Saved Token
Saved Token
Return Saved Token
![Page 40: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/40.jpg)
Token Filter – Splitting Patternpublic final class SplitDashFilter extends TokenFilter { public SplitDashFilter(TokenStream in) { super(in); termAtt = (CharTermAttribute) addAttribute(CharTermAttribute.class); } private CharTermAttribute termAtt; char[] saveToken = new char[100]; // Buffer to hold tokens from previous incrementToken() call int saveLen = 0;
public final boolean incrementToken() throws IOException { if(saveLen > 0) { // Output previously saved token termAtt.setEmpty(); termAtt.append(new String(saveToken, 0, saveLen)); saveLen = 0; return true; } if (input.incrementToken()) { // Get a new token to split final char[] buffer = termAtt.buffer(); final int length = termAtt.length(); boolean foundDash = false; for(int i=0;i<length;i++) { // Scan token looking for ‘–’ to split it if(buffer[i] == ‘-’) { foundDash = true; termAtt.setLength(i); // Set length so termAtt = first half now } else if(foundDash) saveToken[saveLen++] = buffer[i]; // Save second half for later } return true; // Output first half right away } else return false; }}
![Page 41: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/41.jpg)
Token Splitting
DEMO
![Page 42: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/42.jpg)
Stemmers and Lemmatizers• Stemmers available in Lucene
– Snoball, Porter– They are both terrible [much too aggressive]– For example: mining min
• Kstem– Publicly available stemmer with Lucene TokenFilter
Implementation– Better, but still too aggressive:
• searchmining searchmine
• Search Technologies Lemmatizer– Based on GCIDE Dictionary– Extremely accurate, only reduces words to dictionary entries– Also does irregular spelling reduction: mice mouse– STILL A WORK IN PROGRESS: Needs one more refactor
![Page 43: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/43.jpg)
ST Query Processing
Lucene
![Page 44: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/44.jpg)
Search Technologies Query Parser
• Originally written for GPO– Query FAST FQL
• Converted to .Net for CPA• Refactored for Lucene for Aspermont• Refactored to be more componentized and
pipeline-oriented for OLRC
• Still a work in progress– Lacks documentation, wiki, etc.
![Page 45: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/45.jpg)
Search Technologies Query Processing
• Query Parser– Parses the user’s entered query
• Query Processing Pipeline– A sequence of query processing components which can
be mixed and matched• Lucene Query Builder• Other Query Builders Possible– FAST, Google, etc.– No others implemented yet
• Query Configuration File– Holds query parsing and processing parameters
![Page 46: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/46.jpg)
Our Own Query Processing: Why?
• Gives us more control– Can exactly meet user’s query syntax
• Exposes operators not available through Lucene Syntax– Example: before proximity operator
• “behind the scenes” query tweaking– Field weighting– Token merging: rio tinto url:riotinto– Exact case and exact suffix matching– True lemmatization (not just stemming)
![Page 47: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/47.jpg)
ST Query Parser – Overall Structure
Parser
QueryString
ProcessorTopDocs
Generic “AQNode”Structures
Processor LuceneBuilder
. . .
Lucene QueryStructures
![Page 48: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/48.jpg)
The Search Technologies Query Structure
userQuery
nodeQuery
finalQuery
QueryString
Lucene QueryStructures
Generic AQNodeStructures
• Holds references to all query representations• Therefore, query processors can process any
query representation• Everything is a QueryProcessor– Parsing, processing, and query building
![Page 49: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/49.jpg)
Query Parser: Features
• AND, OR, NOT, parenthesis– ((star and wars) or (star and trek))– star and not born {broken}
• +, -+ = query boost- = not {broken}
• Proximity operators– within/3, near/3, adj
• Phrases• field: searches• title:(star and wars) and description:(the original)
![Page 50: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/50.jpg)
Using Query Processors
• Load Query ConfigurationQueryConfig qConfig = new QueryConfig("data/query-config.xml");
• Create Query ProcessorIQueryProcessor iqp2 = new TokenizationQueryProcessor();
• Initialize Query Processoriqp2.initialize(qConfig);
• Use Query Processors (simply call in sequence)iqp1.process(query);iqp2.process(query);iqp3.process(query);
![Page 51: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/51.jpg)
Query Processors: Other Notes
• Types of processors: (off the shelf)– Lemmatizer, tokenization, lower case
• QueryParser and Query classes may need to be fully qualifiedcom.searchtechnologies.queryprocessor.Query query = new com.searchtechnologies.queryprocessor.Query(queryString);
• Query Parser Only Splits on Whitespacestar-trek or star-wars or(star-trek,star-wars)
• Use TokenizationQueryProcessor to split fully or(phrase(star,trek),phrase(star,wars))
![Page 52: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/52.jpg)
ST Query Processor – Exercise 5
• Add ST QueryProcessor to your Lucene Query• Add Dependency to your pom.xml:– com.searchtechnologies: st-queryparser: 0.3
• Add Processors– com.searchtechnologies.queryprocessor.QueryParser– TokenizationQueryProcessor()– LowercaseQueryProcessor()– LuceneQueryBuilder()
• Initialize Config, Construct Processors, Initialize Processors, Execute Processors
![Page 53: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/53.jpg)
Creating Your Own Query Processor
• AQNode – “Aspire Query Node”– Operands – list of operands (references to other
AQNodes)– Operator – Enumerated list (AND, OR, NEAR…)– Proximity window (int)– From value, to value (objects)
• Use from value for token strings• Use from + to value for date ranges, int ranges, etc.
– startChar, endChar (in original user’s query string)– Enclosing field name– Other stuff for future expansion
• Attached data objects• Custom Query Builder
![Page 54: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/54.jpg)
Query Processor Outline
public class MyQueryProcessor implements IQueryProcessor {
@Override public void initialize(QueryConfig config) throws QueryProcessorException { // Read any parameters you need from the config // config is an AXML (a wrapper around a W3C DOM object) }
@Override public void process(Query query) throws QueryProcessorException { // Process the query // query.getNodeQuery() The AQNode version of the query // query.getUserQuery() The original query string // query.getFinalQuery() The final (typically Lucene) query structure }}
![Page 55: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/55.jpg)
Query Processor Examplepublic class LowercaseQueryProcessor implements IQueryProcessor {
@Override public void initialize(QueryConfig config) throws QueryProcessorException { }
@Override public void process(Query query) throws QueryProcessorException { convertToLowerCaseAQNodes(query.getNodeQuery()); } void convertToLowerCaseAQNodes(AQNode aqn) { if(aqn.getOperator() == AQNode.OperatorEnum.TERM) { String termText = (String)aqn.getFromValue(); aqn.setFromValue(termText.toLowerCase()); return; }
if(aqn.getOperands() == null) return;
for(AQNode childAqn : aqn.getOperands()) { convertToLowerCaseAQNodes(childAqn); } }}
![Page 56: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/56.jpg)
ST Query Processor – Exercise 6
• Copy the FixStarQueryProcessor– Looks for “sta” and changes them to “star”
• Fill out the contents of the QueryProcessor• Add the QueryProcessor to your query
program• Run the program and query on “sta”– Add to STQueryProcessorExcercise5.java
![Page 57: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/57.jpg)
Query Processing New Features
• Template Substitution (OLRC)– field:() searches are substituted for arbitrary query
expressions• Lemmatization (OLRC, BNA)• Wildcard Handling (OLRC)• Refactor Aspermont Query Processors– Semantic Network Expansion (ontology)– Add boost/reduce tokens (field:HI, field:LO)– Proximity boost– Composite fields and query field boost
![Page 58: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/58.jpg)
Custom Hit Collector
• Collect() method called for each matching doc– Should be fast– Throw exception to break out of loop– Relation to Scorer
• DecadesCollector– Custom collector to take the top scoring document
from each decade• One main collector that wraps one
TopDocsScoreCollector per decade• See Source DecadesCollector.java
![Page 59: Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing](https://reader030.vdocuments.site/reader030/viewer/2022033105/56649e565503460f94b4e2e2/html5/thumbnails/59.jpg)
Complete
Open Source Search Engine