survey of open source full text search solutions

12
SURVEY OF OPEN SOURCE FULL TEXT SEARCH SOLUTIONS Curtis Spencer, [email protected]

Upload: lawebdevmeetup

Post on 10-Apr-2015

1.382 views

Category:

Documents


3 download

DESCRIPTION

By Curtis Spencer. Presented at http://web.meetup.com/34/calendar/8883832/ . An introduction to full text search and comparison between open source solutions MySQL full text, Sphinx, Lucene/SOLR, and when to use which solution.

TRANSCRIPT

Page 1: Survey of Open Source Full Text Search Solutions

SURVEY OF OPEN SOURCE FULL TEXT SEARCH SOLUTIONS

Curtis Spencer, [email protected]

Page 2: Survey of Open Source Full Text Search Solutions

SEARCH WITHOUT FULL TEXT

• SQL “Like %product%”– Easy to setup, but…– SQL statements get too complex (giant OR and

(?,?,?,?,?,?,?,?,?,?)– Indexes on many columns become unwieldy and

slow down inserts– Limited to prefixs of bigger text columns– Separation of Power: Join Index vs. Full Text

• Outsource to Google– Hosted Solution– Can only reach data that you actually render to

html

Page 3: Survey of Open Source Full Text Search Solutions

FULL TEXT GOAL

Return matches by relevance rather than pure equality value match

Precision vs. Recall Precision – Are the results accurate? Recall – Did we get all the results we expected?

Natural Language Search Queries such as “What is the fastest animal?”

Page 4: Survey of Open Source Full Text Search Solutions

FULL TEXT IMPLEMENTATION

Inverted Index Data Structure Index of words to document’s location on disk

Tokenization, Stopwords Internationalization Challenges

Basic Query Languages Boolean match, relevance, proximity, etc. “World Series +Poker –Baseball”

Page 5: Survey of Open Source Full Text Search Solutions

Based on Apple’s Search Kit Impl

Page 6: Survey of Open Source Full Text Search Solutions

LANGUAGE STEMMING

Reduce inflected words to their root Increase recall Decrease inverted index size

Internationalization Challenges Language detection of the dataset to determine

which stemming algorithm to use Complexity proportional to the level of

morphology Porter Stemming Algorithm

Examples: names -> name, departed -> depart, Mariners -> marin, Marin -> marin

Snowball Project has a lot of different stemming implementations.

Page 7: Survey of Open Source Full Text Search Solutions

MYSQL FULL TEXT

• Pluses– Integrated into MySQL– Easy to use without learning a new library

• Minuses– Indexes bigger than memory tend to be slow– Scalability options are limited– Can slow down insertions, deletions– CJK is lacking

Page 8: Survey of Open Source Full Text Search Solutions

SPHINX

• Pluses– Very Fast– Supports many data sources– Retrieval can be integrated into MySQL– Distributed Searching is a scaling option

• Minuses– Configuration can be tricky– Live index updates accomplished by delta

indexing– Internationalization (besides Russian) is left as an

exercise for the reader

Page 9: Survey of Open Source Full Text Search Solutions

LUCENE/SOLR

• Pluses– Java, so easy to integrate into client software as

well as web– Stable– Distributed Searching– Powerful Query Language– Extensible API– Good Internationalization Support

• Minuses– Java– Configuration is a pain

Page 10: Survey of Open Source Full Text Search Solutions

WHEN TO USE WHAT

Page 11: Survey of Open Source Full Text Search Solutions

Questions?

Page 12: Survey of Open Source Full Text Search Solutions

THANK YOU!