Download - Postgresql search demystified
* in *
PgConf EU 2014 presents
Javier RamirezPostgreSQL
Full-text search
demystified@supercoco9
https://teowaki.com
The problem
our architecture
One does not simply
SELECT * from stuff where
content ilike '%postgresql%'
Basic search features
* stemmers (run, runner, running)* unaccented (josé, jose)* results highlighting* rank results by relevance
Nice to have features* partial searches
* search operators (OR, AND...)
* synonyms (postgres, postgresql, pgsql)
* thesaurus (OS=Operating System)
* fast, and space-efficient
* debugging
Good News:
PostgreSQL supports all
the requested features
Bad News:
unless you already know about search
engines, the official docs are not obvious
How a search engine works
* An indexing phase
* A search phase
The indexing phase
Convert the input text to tokens
The search phase
Match the search terms to
the indexed tokens
indexing in depth
* choose an index format
* tokenize the words
* apply token analysis/filters
* discard unwanted tokens
the index format
* r-tree (GIST in PostgreSQL)
* inverse indexes (GIN in PostgreSQL)
* dynamic/distributed indexes
dynamic indexes: segmentation
* sometimes the token index is
segmented to allow faster updates
* consolidate segments to speed-up
search and account for deletions
tokenizing
* parse/strip/convert format
* normalize terms (unaccent, ascii,
charsets, case folding, number precision..)
token analysis/filters
* find synonyms
* expand thesaurus
* stem (maybe in different languages)
more token analysis/filters
* eliminate stopwords
* store word distance/frequency
* store the full contents of some fields
* store some fields as attributes/facets
“the index file” is really
* a token file, probably segmented/distributed
* some dictionary files: synonyms, thesaurus,
stopwords, stems/lexems (in different languages)
* word distance/frequency info
* attributes/original field files
* optional geospatial index
* auxiliary files: word/sentence boundaries, meta-info,
parser definitions, datasource definitions...
the hardest
part is now
over
searching in depth* tokenize/analyse
* prepare operators
* retrieve information
* rank the results
* highlight the matched parts
searching in depth: tokenize
normalize, tokenize, and analyse
the original search term
the result would be a tokenized, stemmed,
“synonymised” term, without stopwords
searching in depth: operators
* partial search
* logical/geospatial/range operators
* in-sentence/in-paragraph/word distance
* faceting/grouping
searching in depth: retrieval
Go through the token index files, use the
attributes and geospatial files if necessary
for operators and/or grouping
You might need to do this in a distributed way
searching in depth: ranking
algorithm to sort the most relevant results:
* field weights
* word frequency/density
* geospatial or timestamp ranking
* ad-hoc ranking strategies
searching in depth: highlighting
Mark the matching parts of the results
It can be tricky/slow if you are not storing the full contents
in your indexes
PostgreSQL as a
full-text
search engine
search features
* index format configuration
* partial search
* word boundaries parser (not configurable)
* stemmers/synonyms/thesaurus/stopwords
* full-text logical operators
* attributes/geo/timestamp/range (using SQL)
* ranking strategies
* highlighting
* debugging/testing commands
indexing in postgresql
you don't actually need an index to use full-text search in PostgreSQL
but unless your db is very small, you want to have one
Choose GIST or GIN (faster search, slower indexing,
larger index size)
CREATE INDEX pgweb_idx ON pgweb USING
gin(to_tsvector(config_name, body));
Two new things
CREATE INDEX ... USING gin(to_tsvector (config_name, body));
* to_tsvector: postgresql way of saying “tokenize”
* config_name: tokenizing/analysis rule set
Configuration
CREATE TEXT SEARCH CONFIGURATION
public.teowaki ( COPY = pg_catalog.english );
Configuration
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = en_us,
AffFile = en_us,
StopWords = spanglish
);
CREATE TEXT SEARCH DICTIONARY spanish_ispell (
TEMPLATE = ispell,
DictFile = es_any,
AffFile = es_any,
StopWords = spanish
);
Configuration
CREATE TEXT SEARCH DICTIONARY english_stem (
TEMPLATE = snowball,
Language = english,
StopWords = english
);
CREATE TEXT SEARCH DICTIONARY spanish_stem (
TEMPLATE= snowball,
Language = spanish,
Stopwords = spanish
);
Configuration
Parser.
Word boundaries
Configuration
Assign dictionaries (in specific to generic order)
ALTER TEXT SEARCH CONFIGURATION teowaki
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword,
hword_part
WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem;
ALTER TEXT SEARCH CONFIGURATION teowaki
DROP MAPPING FOR email, url, url_path, sfloat, float;
debugging
select * from ts_debug('teowaki', 'I am searching unas
b squedas con postgresql database');ú
also ts_lexize and ts_parser
tokenizing
tokens + position (stopwords are removed, tokens are folded)
searching
SELECT guid, description from wakis where
to_tsvector('teowaki',description)
@@ to_tsquery('teowaki','postgres');
searching
SELECT guid, description from wakis where
to_tsvector('teowaki',description)
@@ to_tsquery('teowaki','postgres:*');
operators
SELECT guid, description from wakis where
to_tsvector('teowaki',description)
@@ to_tsquery('teowaki','postgres | mysql');
ranking weights
SELECT setweight(to_tsvector(coalesce(name,'')),'A') ||
setweight(to_tsvector(coalesce(description,'')),'B')
from wakis limit 1;
search by weight
ranking
SELECT name, ts_rank(to_tsvector(name), query) rank
from wakis, to_tsquery('postgres | indexes') query
where to_tsvector(name) @@ query order by rank DESC;
also ts_rank_cd
highlighting
SELECT ts_headline(name, query) from wakis,
to_tsquery('teowaki', 'game|play') query
where to_tsvector('teowaki', name) @@ query;
USE POSTGRESQL
FOR EVERYTHING
When PostgreSQL is not good
* You need to index files (PDF, Odx...)
* Your index is very big (slow reindex)
* You need a distributed index
* You need complex tokenizers
* You need advanced rankers
When PostgreSQL is not good
* You want a REST API
* You want sentence/ proximity/ range/
more complex operators
* You want search auto completion
* You want advanced features (alerts...)
But it has been
perfect for us so far.
Our users don't care
which search engine
we use, as long as
it works.
* in *
PgConf EU 2014 presents
Javier RamirezPostgreSQL
Full-text search
demystified@supercoco9
https://teowaki.com