full-text search with lucene yonik seeley [email protected] 02 may 2007 amsterdam, netherlands
TRANSCRIPT
What is Lucene
• High performance, scalable, full-text search library
• Written by Doug Cutting, 100% Java
• Focus: Indexing + Searching Documents
• Easily embeddable, no config files
• No crawlers or document parsing
Inverted Index
aardvark
hood
red
little
riding
robin
women
zoo
Little Red Riding Hood
Robin Hood
Little Women
0 1
0 2
0
0
2
1
0
1
2
Basic Application
IndexWriter IndexSearcher
Lucene Index
Documentfield1: value1field2: value2field3: value3
Hits(Matching Docs)Query
addDocument() search()
Indexing Documents
IndexWriter writer = new IndexWriter(directory, analyzer, true);
Document doc = new Document();doc.add(new Field("title", "Lucene in Action",
Field.Store.YES, Field.Index.TOKENIZED));doc.add(new Field("author", "Erik Hatcher",
Field.Store.YES, Field.Index.TOKENIZED));doc.add(new Field("author", "Otis Gospodnetic",
Field.Store.YES, Field.Index.TOKENIZED));writer.addDocument(doc);writer.close();
Field Options
• Indexed– Necessary for searching or sorting
• Tokenized– Text analysis done before indexing
• Stored
• Compressed
• Binary– Currently for stored-only fields
Searching an IndexIndexSearcher searcher = new
IndexSearcher(directory);QueryParser parser = new
QueryParser("defaultField", analyzer);Query query = parser.parse("title:Lucene");Hits hits = searcher.search(query);System.out.println(“matches:" + hits.length());Document doc = hits.doc(0);System.out.println(“first:" + doc.get("title"));searcher.close();
Scoring
• VSM – Vector Space Model• tf – numer of terms in field• lengthNorm – number of tokens in field• idf – number of documents containing term• coord – coordination factor, number of matching
terms• document boost• query clause boost
http://lucene.apache.org/java/docs/scoring.html
Query ConstructionLucene QueryParser • Example: queryParser.parse("title:spiderman");• good for IPC, human entered queries, debug• does text analysis and constructs appropriate
queries • not all query types supported
Programmatic query construction• Example: new TermQuery(new
Term(“title”,”spiderman”))• explicit, no escaping necessary
Query Examples
1. mission impossible• EQUIV: mission OR impossible• QueryParser default is “optional”
2. +mission +impossible –actor:cruise• EQUIV: mission AND impossible NOT cruise
3. “mission impossible” –actor:cruise
4. title:spiderman^10 description:spiderman
5. description:“spiderman movie”~10
Query Examples2
1. releaseDate:[2000 TO 2007]• Range search: lexicographic ordering, so
beware of numbers
2. Wildcard searches: te?t, te*t, test*3. spider~
• Fuzzy search: Levenshtein distance• Optional minimum similarity: spider~0.7
4. *:*5. (a AND b) OR (c AND d)
Deleting Documents
• IndexReader.deleteDocument(int id)– exclusive with IndexWriter– powerful
• Deleting with IndexWriter– deleteDocuments(Term t)– updateDocument(Term t, Document d)
• Deleting does not immediately reclaim space
Performance
• Decrease index segments
• Lower merge factor
• Optimize
• Use cached filters‘+title:spiderman +released:true’
‘title:spiderman’ filtered by ‘released:true’
Index Structure
_0.fnm_0.fdt_0.fdx_0.frq_0.tis_0.tii_0.prx_0.nrm
_0_1.del
segments_3IndexWriter params
• MaxBufferedDocs
• MergeFactor
• MaxMergeDocs
• MaxFieldLength
Search Relevancy
PowerShot SD 500
PowerShot SD 500
SD 500Power Shot
PowerShot
sd 500power shot
powershot
WhitespaceTokenizer
WordDelimiterFilter catenateWords=1
LowercaseFilter
power-shot sd500
power-shot sd500
sd 500power shot
sd 500power shot
WhitespaceTokenizer
WordDelimiterFilter catenateWords=0
LowercaseFilter
Query Analysis
A Match!
Document Analysis
Tokenizers
• Tokenizers break field text into tokens
• StandardTokenizer– source string: “full-text lucene.apache.org”– “full” “text” “lucene.apache.org”
• WhitespaceTokenizer– “full-text” “lucene.apache.org”
• LetterTokenizer– “full” “text” “lucene” “apache” “org”
TokenFilters• LowerCaseFilter• StopFilter• LengthFilter• ISOLatin1AccentFilter• SnowballPorterFilter
– stemming: reducing words to root form– rides, ride, riding => ride– country, countries => countri
• contrib/analyzers for other languages
Analyzersclass MyAnalyzer extends Analyzer { private Set myStopSet =
StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);
public TokenStream tokenStream(String fieldname, Reader reader) {
TokenStream ts = new StandardTokenizer(reader); ts = new StandardFilter(ts); ts = new LowerCaseFilter(ts); ts = new StopFilter(ts, myStopSet); return ts; }}
Analysis Tips• Use PerFieldAnalyzerWrapper
• Add same field more than once, analyze differently– Boost exact case matches– Boost exact tense matches– Query with or without synonyms– Soundex for sounds-like queries
Nutch
• Open source web search application
• Crawlers
• Link-graph database
• Document parsers (HTML, word, pdf, etc)
• Language + charset detection
• Utilizes Hadoop (DFS + MapReduce) for massive scalability
Solr
• XML/HTTP, JSON APIs
• Faceted search / navigation
• Flexible Data Schema
• Hit Highlighting
• Configurable Caching
• Replication
• Web admin interface
• Solr Flare: Ruby on Rails user interface
Questions?