survey of open source full text search solutions
DESCRIPTION
By Curtis Spencer. Presented at http://web.meetup.com/34/calendar/8883832/ . An introduction to full text search and comparison between open source solutions MySQL full text, Sphinx, Lucene/SOLR, and when to use which solution.TRANSCRIPT
SURVEY OF OPEN SOURCE FULL TEXT SEARCH SOLUTIONS
Curtis Spencer, [email protected]
SEARCH WITHOUT FULL TEXT
• SQL “Like %product%”– Easy to setup, but…– SQL statements get too complex (giant OR and
(?,?,?,?,?,?,?,?,?,?)– Indexes on many columns become unwieldy and
slow down inserts– Limited to prefixs of bigger text columns– Separation of Power: Join Index vs. Full Text
• Outsource to Google– Hosted Solution– Can only reach data that you actually render to
html
FULL TEXT GOAL
Return matches by relevance rather than pure equality value match
Precision vs. Recall Precision – Are the results accurate? Recall – Did we get all the results we expected?
Natural Language Search Queries such as “What is the fastest animal?”
FULL TEXT IMPLEMENTATION
Inverted Index Data Structure Index of words to document’s location on disk
Tokenization, Stopwords Internationalization Challenges
Basic Query Languages Boolean match, relevance, proximity, etc. “World Series +Poker –Baseball”
Based on Apple’s Search Kit Impl
LANGUAGE STEMMING
Reduce inflected words to their root Increase recall Decrease inverted index size
Internationalization Challenges Language detection of the dataset to determine
which stemming algorithm to use Complexity proportional to the level of
morphology Porter Stemming Algorithm
Examples: names -> name, departed -> depart, Mariners -> marin, Marin -> marin
Snowball Project has a lot of different stemming implementations.
MYSQL FULL TEXT
• Pluses– Integrated into MySQL– Easy to use without learning a new library
• Minuses– Indexes bigger than memory tend to be slow– Scalability options are limited– Can slow down insertions, deletions– CJK is lacking
SPHINX
• Pluses– Very Fast– Supports many data sources– Retrieval can be integrated into MySQL– Distributed Searching is a scaling option
• Minuses– Configuration can be tricky– Live index updates accomplished by delta
indexing– Internationalization (besides Russian) is left as an
exercise for the reader
LUCENE/SOLR
• Pluses– Java, so easy to integrate into client software as
well as web– Stable– Distributed Searching– Powerful Query Language– Extensible API– Good Internationalization Support
• Minuses– Java– Configuration is a pain
WHEN TO USE WHAT
Questions?
THANK YOU!