survey of open source full text search solutions

SURVEY OF OPEN SOURCE FULL TEXT SEARCH SOLUTIONS

Curtis Spencer, [email protected]

SEARCH WITHOUT FULL TEXT

• SQL “Like %product%”– Easy to setup, but…– SQL statements get too complex (giant OR and

(?,?,?,?,?,?,?,?,?,?)– Indexes on many columns become unwieldy and

slow down inserts– Limited to prefixs of bigger text columns– Separation of Power: Join Index vs. Full Text

• Outsource to Google– Hosted Solution– Can only reach data that you actually render to

html

FULL TEXT GOAL

Return matches by relevance rather than pure equality value match

Precision vs. Recall Precision – Are the results accurate? Recall – Did we get all the results we expected?

Natural Language Search Queries such as “What is the fastest animal?”

FULL TEXT IMPLEMENTATION

Inverted Index Data Structure Index of words to document’s location on disk

Tokenization, Stopwords Internationalization Challenges

Basic Query Languages Boolean match, relevance, proximity, etc. “World Series +Poker –Baseball”

Based on Apple’s Search Kit Impl

LANGUAGE STEMMING

Reduce inflected words to their root Increase recall Decrease inverted index size

Internationalization Challenges Language detection of the dataset to determine

which stemming algorithm to use Complexity proportional to the level of

morphology Porter Stemming Algorithm

Examples: names -> name, departed -> depart, Mariners -> marin, Marin -> marin

Snowball Project has a lot of different stemming implementations.

MYSQL FULL TEXT

• Pluses– Integrated into MySQL– Easy to use without learning a new library

• Minuses– Indexes bigger than memory tend to be slow– Scalability options are limited– Can slow down insertions, deletions– CJK is lacking

SPHINX

• Pluses– Very Fast– Supports many data sources– Retrieval can be integrated into MySQL– Distributed Searching is a scaling option

• Minuses– Configuration can be tricky– Live index updates accomplished by delta

indexing– Internationalization (besides Russian) is left as an

exercise for the reader

LUCENE/SOLR

• Pluses– Java, so easy to integrate into client software as

well as web– Stable– Distributed Searching– Powerful Query Language– Extensible API– Good Internationalization Support

• Minuses– Java– Configuration is a pain

WHEN TO USE WHAT

Questions?

THANK YOU!

survey of open source full text search solutions

Documents