full text search
DESCRIPTION
This contains basic information about full text search and how it can be implemented in PostgreSQL. This was presented at India PostgreSQL meetup at Pune on 16 Nov, 2013.TRANSCRIPT
![Page 1: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/1.jpg)
© 2013 NTT DATA, Inc.
Rahila Syed Beena Emerson
Full text Search
![Page 2: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/2.jpg)
© 2013 NTT DATA, Inc. 2
• Full text search and its types
• Full text search in PostgreSQL
• PostgreSQL extension
• Similarity Search
Index
![Page 3: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/3.jpg)
3 © 2013 NTT DATA, Inc.
Full Text Search
![Page 4: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/4.jpg)
© 2013 NTT DATA, Inc. 4
• Searching for a group of keywords in a pile of texts
– Document
– Query
– Similarity
• Full text search in database
– Searching for a set of keywords in a text field of a database table
– The data used for full text search can be huge
– Indexing words and associating indexed words with documents
What is full text search?
![Page 5: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/5.jpg)
5 © 2013 NTT DATA, Inc.
Full Text Search in PostgreSQL
![Page 6: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/6.jpg)
© 2013 NTT DATA, Inc. 6
• Creating Tokens
– Parsing document into set of tokens like numbers, words, complex words, email addresses.
• Creating Lexemes
– Normalization: Dictionary controls this.
• Removal of suffixes – converts variants into a single form (worry, worries, worried, etc.)
• Conversion to lower case
• Remove stop words – common words useless for searching (the, at etc.)
• Storing preprocessed documents
– Storing documents and creating indexes over them for faster search
• Relevance ranking
Steps
![Page 7: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/7.jpg)
© 2013 NTT DATA, Inc. 7
Full text search in PostgreSQL
• Full integration
• 27 built-in configurations for 10 languages
• Support of user-defined FTS configurations
• Pluggable dictionaries ( ispell, snowball, thesaurus ), parsers
• Relevance ranking
• GIN and GiST index
![Page 8: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/8.jpg)
© 2013 NTT DATA, Inc. 8
Full text search in PostgreSQL
Morphological Search
• Indexed tokens are words of a language
• Eg. Tree, book, rain
• Small index size
• Good in orthographical variants
• Search results depends on division of words
• Used for large documents like thesis
• Ex. Tsvector
N-gram search
• Indexed tokens are characters.
• Eg. _t, tr, re, e_ (2 grams)
• Big index size
• Cannot match orthographical variants
• Results closer to indexed LIKE
• Better suited for a limited set of words
• Ex. pg_bigm, pg_tigm
![Page 9: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/9.jpg)
© 2013 NTT DATA, Inc. 9
• Search similar words(No linguistic support)
• Ranking of search results
• Searches substrings
– Indexes does not support substring search
– LIKE operator doesn’t use INDEX when preceded by %.
– Low performance when compared to full text search using GIN and GiST
• Accuracy issue
Eg. LIKE %one% matches prone, money, lonely
Why full text search?
![Page 10: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/10.jpg)
© 2013 NTT DATA, Inc. 10
• POSIX Expression =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc ~ 'postgresql';
QUERY PLAN
--------------------------------------------------------------------------
Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40
width=152) (actual time=10.871..390.019 rows=250 loops=1)
Filter: (doc ~ 'postgresql'::text)
Rows Removed by Filter: 11397
Total runtime: 390.060 ms
Measurement results
• LIKE Query =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc LIKE '%postgresql%';
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40 width=152) (actual time=1.342..110.107 rows=250 loops=1)
Filter: (doc ~~ '%postgresql%'::text)
Rows Removed by Filter: 11397
Total runtime: 110.134 ms
![Page 11: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/11.jpg)
© 2013 NTT DATA, Inc. 11
Measurement results
• Full Text Search
Nested Loop (cost=352.83..508.22 rows=107 width=64) (actual
time=1.397..1.575 rows=250 loops=1)
-> Function Scan on to_tsquery query (cost=0.00..0.01 rows=1 width=32)
(actual time=0.023..0.023 rows=1 loops=1)
-> Bitmap Heap Scan on full_text_search (cost=352.83..507.14 rows=107
width=32) (actual time=1.371..1.516 rows=250 loops=1)
Recheck Cond: (query.query @@ to_tsvector('english'::regconfig,
doc))
-> Bitmap Index Scan on full_search_idx (cost=0.00..352.80
rows=107 width=0) (actual time=1.354..1.354 rows=348 loops=1)
Index Cond: (query.query @@
to_tsvector('english'::regconfig, doc))
Total runtime: 1.619 ms
![Page 12: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/12.jpg)
© 2013 NTT DATA, Inc. 12
Normal Search: SELECT * FROM tbl WHERE col1 LIKE 'The tiger is the largest cat
species';
col1
--------------------------------------
The tiger is the largest cat species
(1 row)
Ranking Example
Full Text Search: SELECT col1, similarity(col1, 'The tiger is the largest cat
species') AS sml
FROM tbl_t WHERE col1 % 'The tiger is the largest cat species'
ORDER BY sml DESC, col1;
col1 | sml
-----------------------------------------+----------
The tiger is the largest cat species | 1
The peacock is the largest bird species | 0.511111
The cheetah is the fastest cat species | 0.466667
(3 rows)
![Page 13: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/13.jpg)
© 2013 NTT DATA, Inc. 13
• GIN(Generalized Inverted Index)
• Custom strategies for particular data types
• Inverted indexes
• Interface for custom data types
• Slower to update
• Deterministic
• Appropriate for fixed data sets.
Indexes Used in Full Text Search
KEY TID
Meetup
100 ,140
Pune 100 , 150
Here 100
![Page 14: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/14.jpg)
© 2013 NTT DATA, Inc. 14
• GiST (Generalized Search Tree)
• Interface for data types and access methods
• Document is represented in the index by a fixed-length signature
• Based on hash tables
• Probability of false match
• Table row must be retrieved to see if the match is correct
• In appropriate for large data sets
• Filtering data at the end of index search to remove false match
EXPLAIN SELECT * FROM tab WHERE text_search @@
to_tsquery(‘Mountain'); ------------------------------- QUERY PLAN -----------------------
-------------------
Index Scan using text_search_idx on tab (cost=0.00..12.29 rows=2
width=1469)
Index Cond: (textsearch @@ '‘Mountain'''::tsquery) Filter: (textsearch @@ ''‘Mountain'''::tsquery)
Indexes Used in Full Text Search
![Page 15: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/15.jpg)
© 2013 NTT DATA, Inc. 15
• Representation of document best suited for full text search
• Normalized lexemes formed by pre-processing of the documents
• Functions to convert normal text to tsvector:
• to_tsvector to_tsvector([ config regconfig, ] document text) returns
tsvector
=# SELECT to_tsvector('english', 'Glad to be part of this
meetup');
to_tsvector
------------------------------
'glad':1 'meetup':7 'part':4
(1 row)
• The query above specifies 'english' as the configuration to be used to
parse and normalize the strings. The default_text_search_config value will be used if the configuration parameter is omitted.
tsvector
![Page 16: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/16.jpg)
© 2013 NTT DATA, Inc. 16
• Representation of search query best suited for full text search
• Normalized lexemes formed by processing the query
• Maybe combined using AND, OR, or NOT operator.
• All keywords used for search
tsquery
![Page 17: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/17.jpg)
© 2013 NTT DATA, Inc. 17
• Functions to convert normal text to tsquery:
• to_tsquery to_tsquery([ config regconfig, ] querytext text) returns
tsquery
=# SELECT to_tsquery('meetups & in & ! Pune');
to_tsquery
--------------------
'meetup' & !'pune'
(1 row)
• plainto_tsquery plainto_tsquery([ config regconfig, ] querytext text)
returns tsquery
=# SELECT plainto_tsquery ('english','meetups in Pune');
plainto_tsquery
-------------------
'meetup' & 'pune'
(1 row)
tsquery
![Page 18: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/18.jpg)
© 2013 NTT DATA, Inc. 18
• Checks a tsvector(document) with a tsquery(search word)
• Returns true if all tsquery elements are present in the tsvector of the document
=# SELECT to_tsvector('Welcome to this postgresql meetup') @@
plainto_tsquery('PostgreSQL Meetups');
?column?
----------
t
(1 row)
=# SELECT to_tsvector('Welcome to this postgresql meetup') @@
plainto_tsquery('Pune meetup');
?column?
----------
f
(1 row)
Match operator @@
![Page 19: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/19.jpg)
© 2013 NTT DATA, Inc. 19
SELECT * FROM <table> WHERE
to_tsvector('<config>', <colname>) @@ to_tsquery('<config>',
'<search word>');
The configuration parameter of the functions to_tsvector and to_tsquery should be same.
Example:
=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@
to_tsquery('english', 'enjoy');
col
--------------------------------
He enjoyed the party
He enjoys the classical music.
(2 rows)
Full text search without index
![Page 20: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/20.jpg)
© 2013 NTT DATA, Inc. 20
• Creating the index CREATE INDEX <index_name> ON <table> USING
gin(to_tsvector('<config>', <col>));
• Performing search using the index: SELECT * FROM <table> WHERE to_tsvector('<config>', <col>) @@
plainto_tsquery('<config>','<search word>')
Example:
=# CREATE INDEX idx ON tbl USING gin(to_tsvector('english',
col));
=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@
plainto_tsquery('english','enjoy');
col
--------------------------------
He enjoyed the party
He enjoys the classical music.
(2 rows)
Full text search using index
![Page 21: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/21.jpg)
© 2013 NTT DATA, Inc. 21
• Procedure
– Create a column of tsvector type
– Define a trigger which will automatically update the tsvector column
– Perform Search on the tsvector column
• Advantages:
– No need to specify the text search configuration in every query in order to make use of the index
– Faster searches as the to_tsvector function will not be called for each search query.
Full text search using separate column
![Page 22: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/22.jpg)
© 2013 NTT DATA, Inc. 22
Example:
=# CREATE TABLE tbl (col text, tsv_col tsvector);
=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON tbl FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);
=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the
classical music.'),('The moon winked at him');
=# SELECT * FROM tbl;
col | tsv
--------------------------------+---------------------------------
He enjoyed the party | 'enjoy':2 'parti':4
He enjoys the classical music. | 'classic':4 'enjoy':2 'music':5
The moon winked at him | 'moon':2 'wink':3
(3 rows)
Full text search using separate column
![Page 23: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/23.jpg)
© 2013 NTT DATA, Inc. 23
Example:
=# CREATE TABLE tbl (col text, tsv_col tsvector);
=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON tbl FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);
=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the
classical music.'),('The moon winked at him');
=# SELECT col FROM tbl WHERE tsv_col @@ to_tsquery('enjoys');
col
--------------------------------
He enjoyed the party
He enjoys the classical music.
(2 rows)
Full text search using separate column
![Page 24: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/24.jpg)
© 2013 NTT DATA, Inc. 24
Ranking
•ts_rank
–Lexical ranking
ts_rank([ weights float4[], ] vector tsvector, query tsquery [,
normalization integer ]) returns float4
=# select ts_rank(to_tsvector('Free text seaRCh is a wonderful
Thing'), to_tsquery('wonderful | thing'));
ts_rank ----------- 0.0607927
•ts_rank_cd
–Proximity ranking
=# select ts_rank_cd(to_tsvector('Free text seaRCh is a
wonderful Thing'), to_tsquery('wonderful & thing'));
ts_rank_cd ------------ 0.1
![Page 25: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/25.jpg)
© 2013 NTT DATA, Inc. 25
Ranking
• Structural ranking – Query
select ts_rank( array[0.1,0.1,0.9,0.1],
setweight(to_tsvector('All about search'), 'B') ||
setweight(to_tsvector('Free text seaRCh is a
wonderfulThing'),'A'),
to_tsquery('wonderful & search'));
– Result
ts_rank
0.328337
![Page 26: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/26.jpg)
26 © 2013 NTT DATA, Inc.
PostgreSQL Extension
![Page 27: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/27.jpg)
© 2013 NTT DATA, Inc. 27
• Uses index made from trigrams – 3 consecutive characters from string.
• Find string similarity by comparing the trigrams.
• provides GiST and GIN index operator classes to create index. CREATE INDEX <idx> ON <tbl> USING gist(<col> gist_trgm_ops);
CREATE INDEX <idx> ON <tbl> USING gin (<col> gin_trgm_ops);
• Problem:
− No partial match algorithm
− Slow when search key is < 3 characters
GIN_SEARCH_MODE_ALL is used
pg_trgm
![Page 28: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/28.jpg)
© 2013 NTT DATA, Inc. 28
• PostgreSQL module which provides full text search capability using 2-gram index.
• Based on pg_trgm
• First released on April 2013. Version 1.1 to be released soon.
• Developed by NTT Data
• Site: http://sourceforge.jp/projects/pgbigm/
pg_bigm
![Page 29: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/29.jpg)
© 2013 NTT DATA, Inc. 29
Difference
Feature pg_trgm pg_bigm
Method of full text search
3-gram " a", " ab", abc, bcd
2-gram " a", ab, bc, cd, "d "
Available index GIN and GiST GIN only
1-2 character keyword search
Slow Fast
![Page 30: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/30.jpg)
© 2013 NTT DATA, Inc. 30
• Download tar.gz file from the site
• Install pg_bigm $ make USE_PGXS=1
$ su
# make USE_PGXS=1 install
• Register- Set the postgresql.conf variables: – shared_preload_libraries = 'pg_bigm'
– custom_variable_classes = 'pg_bigm' (only in 9.1)
• Load into the required database =# CREATE EXTENSION pg_bigm;
Install pg_bigm
![Page 31: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/31.jpg)
© 2013 NTT DATA, Inc. 31
Argument: Search String
Return Value: Array of all possible 2-gram character string
Procedure:
• For each word perform the following:
• Add a space character before and after the text
• Moving from left to right extract strings in the unit of 2 characters.
=# SELECT show_bigm('ab');
show_bigm
----------------
{" a",ab,"b "}
(1 row)
Function – show_bigm
![Page 32: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/32.jpg)
© 2013 NTT DATA, Inc. 32
Argument: Search string
Return Value: String in a pattern to be used in LIKE for full-text search
Procedure:
• Add % to the beginning and the end of retrieval string.
• Add a backlash (\) before every underscore (_), percent (%) and backlash (\) present in the retrieval string.
=# SELECT likequery ('pg_bigm ppt');
likequery
----------------
%pg\_bigm ppt%
(1 row)
Function - likequery
![Page 33: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/33.jpg)
© 2013 NTT DATA, Inc. 33
• Only GIN support
• Create Index on the text column of a table CREATE INDEX <index_name> ON <table> USING gin (<column>,
gin_bigm_ops);
Creation of Index
Key TID
" c" 1
" m" 5
at 1, 5
ca 1
ma 5
"t " 1, 5
TID Data
1 cat
5 mat
Generate bigrams cat - " c", at, ca, "t "
mat - " m", at, ma, "t "
Table
Index
![Page 34: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/34.jpg)
© 2013 NTT DATA, Inc. 34
SELECT * FROM <tbl> WHERE <col> LIKE likequery(‘<word>');
=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE likequery('cat');
QUERY PLAN
-------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=12.00..16.01 rows=1 width=4) (actual
time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (col ~~ '%cat%'::text)
-> Bitmap Index Scan on idx (cost=0.00..12.00 rows=1 width=0)
(actual time=0.025..0.025 rows=1 loops=1)
Index Cond: (col ~~ '%cat%'::text)
Total runtime: 0.093 ms
(5 rows)
Full text search Query
![Page 35: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/35.jpg)
© 2013 NTT DATA, Inc. 35
Full text search Query
Generate bigrams
Key TID
" c" 1
" m" 5
at 1, 5
ca 1
ma 5
"t " 1, 5
TID Data
1 cat
Result Candidates
Perform Recheck
Search key
Index lookup
TID Data
1 cat
Final Result
![Page 36: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/36.jpg)
© 2013 NTT DATA, Inc. 36
• Removes wrong results from result candidates of index scan.
=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE
likequery('trial');
QUERY PLAN
-------------------------------------------------------------------
------------------------------------------
Bitmap Heap Scan on tbl (cost=24.00..28.01 rows=1 width=5)
(actual time=0.060..0.060 rows=1 loops=1)
Recheck Cond: (col ~~ '%trial%'::text)
Rows Removed by Index Recheck: 1
-> Bitmap Index Scan on idx (cost=0.00..24.00 rows=1 width=0)
(actual time=0.043..0.043 rows=2 loops=1)
Index Cond: (col ~~ '%trial%'::text)
Total runtime: 0.117 ms
(6 rows)
Why Recheck?
![Page 37: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/37.jpg)
© 2013 NTT DATA, Inc. 37
Why Recheck?
Key TID
" t" 1, 2
al 1, 2
ia 1, 2
iv 2
“l " 1, 2
ri 1, 2
tr 1, 2
vi 2
TID Data
1 trial
2 trivial
trial " t",al,ia,"l ",ri,tr
trivial " t",al,ia,iv,"l ",ri,tr,vi
Search ‘trial’
TID Data
1 trial
2 trivial
TID Data
1 trial
Index scan
Recheck
![Page 38: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/38.jpg)
© 2013 NTT DATA, Inc. 38
Parameter - enable_recheck
• To disable Recheck and get all the results retrieved by index scan
• Values on/off
=# SET pg_bigm.enable_recheck = on;
=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');
doc
----------------------
He is awaiting trial
(1 row)
=# SET pg_bigm.enable_recheck = off;
=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');
doc
--------------------------
He is awaiting trial
It was a trivial mistake
(2 rows)
Disabling Recheck
![Page 39: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/39.jpg)
© 2013 NTT DATA, Inc. 39
=# CREATE TABLE tbl (col text);
=# CREATE INDEX tbl_idx ON tbl USING gin (col gin_bigm_ops);
=# INSERT INTO tbl VALUES
('He is awaiting trial'),
('Those orchids are very special to her '),
('pg_bigm performs full text search using 2 gram index'),
('pg_trgm performs full text search using 3 gram index');
=# SELECT * FROM tbl WHERE col LIKE likequery('full text search');
col
------------------------------------------------------
pg_bigm performs full text search using 2 gram index
pg_trgm performs full text search using 3 gram index
(2 rows)
pg_bigm Full Text Search Sample
![Page 40: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/40.jpg)
40 © 2013 NTT DATA, Inc.
Similarity Search
![Page 41: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/41.jpg)
© 2013 NTT DATA, Inc. 41
Argument: The 2 strings whose similarity is to be checked
Return value - the similarity value of two arguments (0 - 1)
• measures the similarity of two strings by counting the number of 2-grams they share.
=# SELECT bigm_similarity ('test','text');
bigm_similarity
-----------------
0.6
(1 row)
Function – bigm_similarity
![Page 42: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/42.jpg)
© 2013 NTT DATA, Inc. 42
• specifies threshold used for the similarity search
• Search returns rows with similarity value >= similarity_limit
• Default: 0.3
• SET command can be used to modify the value.
=# SHOW pg_bigm.similarity_limit;
pg_bigm.similarity_limit
--------------------------
0.3
(1 row)
=# SET pg_bigm.similarity_limit = 0.5;
Parameter - similarity_limit
![Page 43: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/43.jpg)
© 2013 NTT DATA, Inc. 43
• Used to perform similarity search
• Uses full text search index.
• Returns rows whose similarity is higher than or equal to the value of pg_bigm.similarity_limit
SELECT * FROM <tbl> WHERE <col> =% ‘<key>';
Similarity Operator - =%
![Page 44: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/44.jpg)
© 2013 NTT DATA, Inc. 44
=# SET pg_bigm.similarity_limit = 0.2;
=# SELECT *, bigm_similarity(col, 'test') FROM tbl WHERE col =%
'test';
col | bigm_similarity
-------+-----------------
test | 1
text | 0.6
treat | 0.333333
(3 rows)
=# SET pg_bigm.similarity_limit = 0.5;
=# SELECT *, bigm_similarity(col, 'test') FROM tbl WHERE col =%
'test';
col | bigm_similarity
------+-----------------
test | 1
text | 0.6
(2 rows)
Similarity Search Sample
![Page 45: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/45.jpg)
© 2013 NTT DATA, Inc. 45
• PostgreSQL documents
• wiki.postgresql.org
• Understanding Full Text Search
• http://linuxgazette.net/164/sephton.html
• http://www.slideshare.net/billkarwin/full-text-search-in-postgresql
• Understanding pg_bigm
• pgbigm.sourceforge.jp
• www.slideshare.net/masahikosawada98/pg-bigm
References
![Page 46: Full text search](https://reader031.vdocuments.site/reader031/viewer/2022012322/54937824b47959991f8b4782/html5/thumbnails/46.jpg)
© 2013 NTT DATA, Inc.