postgresql: advanced features in practice
DESCRIPTION
Transactional DDL, partial & function indexes, fuzzy string matching with trigram indexes, views, recursive/with queries and window functions.TRANSCRIPT
J Á N S U C H A L
2 2 . 1 1 . 2 0 1 1
@ R U B Y S L A V A
PostgreSQL: Advanced features in practice
Why PostgreSQL?
The world’s most advanced open source database.
Features!
Transactional DDL
Cost-based query optimizer + Graphical explain
Partial indexes
Function indexes
K-nearest search
Views
Recursive Queries
Window Functions
Transactional DDL
class CreatePostsMigration < ActiveRecord::Migration def change create_table :posts do |t| t.string :name, null: false t.text :body, null: false t.references :author, null: false t.timestamps null: false end add_index :posts, :title, unique: true end end
Where is the problem?
Transactional DDL
class CreatePostsMigration < ActiveRecord::Migration def change create_table :posts do |t| t.string :name, null: false t.text :body, null: false t.references :author, null: false t.timestamps null: false end add_index :posts, :title, unique: true end end
Where is the problem?
Column title does not exist! Table is created, index is not. Oops! Transactional DDL FTW!
Cost-based query optimizer
What is the best plan to execute a given query?
Cost = I/O + CPU operations needed
Sequential vs. random seek
Join order
Join type (nested loop, hash join, merge join)
Graphical EXPLAIN
pgAdmin (www.pgadmin.org)
Partial indexes
Conditional indexes
Problem: Async job/queue table, find failed jobs
Create index on failed_at column
99% of index is never used
Partial indexes
Conditional indexes
Problem: Async job/queue table, find failed jobs
Create index on failed_at column
99% of index is never used
Solution: CREATE INDEX idx_dj_only_failed ON delayed_jobs (failed_at)
WHERE failed_at IS NOT NULL;
smaller index
faster updates
Function Indexes
Problem: Suffix search
SELECT … WHERE code LIKE ‘%123’
Function Indexes
Problem: Suffix search
SELECT … WHERE code LIKE ‘%123’
“Solution”:
Add reverse_code column, populate, add triggers for updates, create index on reverse_code column
reverse queries WHERE reverse_code LIKE “321%”
Function Indexes
Problem: Suffix search SELECT … WHERE code LIKE ‘%123’
“Solution”: Add reverse_code column, populate, add triggers for updates,
create index on reverse_code column,
reverse queries WHERE reverse_code LIKE “321%”
PostgreSQL solution: CREATE INDEX idx_reversed ON projects
(reverse((code)::text) text_pattern_ops);
SELECT … WHERE reverse(code) LIKE
reverse(‘%123’)
K-nearest search
Problem: Fuzzy string matching 900K rows
CREATE INDEX idx_trgm_name ON subjects USING gist (name gist_trgm_ops); SELECT name, name <-> 'Michl Brla' AS dist FROM subjects ORDER BY dist ASC LIMIT 10; (312ms)
"Michal Barla“ ; 0.588235 "Michal Bula“ ; 0.647059 "Michal Broz“ ; 0.647059 "Pavel Michl“ ; 0.647059 "Michal Brna“ ; 0.647059
K-nearest search
Problem: Fuzzy string matching 900K rows
Solution: Ngram/Trigram search
johno = {" j"," jo",”hno”,”joh”,"no ",”ohn”} CREATE INDEX idx_trgm_name ON subjects USING gist (name gist_trgm_ops); SELECT name, name <-> 'Michl Brla' AS dist FROM subjects ORDER BY dist ASC LIMIT 10; (312ms)
"Michal Barla“ ; 0.588235 "Michal Bula“ ; 0.647059 "Michal Broz“ ; 0.647059 "Pavel Michl“ ; 0.647059 "Michal Brna“ ; 0.647059
K-nearest search
Problem: Fuzzy string matching 900K rows
Solution: Ngram/Trigram search
johno = {" j"," jo",”hno”,”joh”,"no ",”ohn”} CREATE INDEX idx_trgm_name ON subjects USING gist (name gist_trgm_ops); SELECT name, name <-> 'Michl Brla' AS dist FROM subjects ORDER BY dist ASC LIMIT 10; (312ms)
"Michal Barla“ ; 0.588235 "Michal Bula“ ; 0.647059 "Michal Broz“ ; 0.647059 "Pavel Michl“ ; 0.647059 "Michal Brna“ ; 0.647059
Views
Constraints propagated down to views
CREATE VIEW edges AS
SELECT subject_id AS source_id,
connected_subject_id AS target_id FROM raw_connections
UNION ALL
SELECT connected_subject_id AS source_id,
subject_id AS target_id FROM raw_connections;
SELECT * FROM edges WHERE source_id = 123;
SELECT * FROM edges WHERE source_id < 500 ORDER BY source_id LIMIT 10 No materialization, 2x indexed select + 1x append/merge
Views
Constraints propagated down to views
CREATE VIEW edges AS
SELECT subject_id AS source_id,
connected_subject_id AS target_id FROM raw_connections
UNION ALL
SELECT connected_subject_id AS source_id,
subject_id AS target_id FROM raw_connections;
SELECT * FROM edges WHERE source_id = 123;
SELECT * FROM edges WHERE source_id < 500 ORDER BY source_id LIMIT 10 No materialization, 2x indexed select + 1x append/merge
Recursive Queries
Problem: Find paths between two nodes in graph WITH RECURSIVE search_graph(source,target,distance,path) AS (
SELECT source_id, target_id, 1,
ARRAY[source_id, target_id]
FROM edges WHERE source_id = 552506
UNION ALL
SELECT sg.source, e.target_id, sg.distance + 1,
path || ARRAY[e.target_id]
FROM search_graph sg
JOIN edges e ON sg.target = e.source_id
WHERE NOT e.target_id = ANY(path) AND distance < 4
)
SELECT * FROM search_graph LIMIT 100
Recursive Queries
Problem: Find paths between two nodes in graph WITH RECURSIVE search_graph(source,target,distance,path) AS (
SELECT source_id, target_id, 1,
ARRAY[source_id, target_id]
FROM edges WHERE source_id = 552506
UNION ALL
SELECT sg.source, e.target_id, sg.distance + 1,
path || ARRAY[e.target_id]
FROM search_graph sg
JOIN edges e ON sg.target = e.source_id
WHERE NOT e.target_id = ANY(path) AND distance < 4
)
SELECT * FROM search_graph LIMIT 100
Recursive Queries
Problem: Find paths between two nodes in graph WITH RECURSIVE search_graph(source,target,distance,path) AS (
SELECT source_id, target_id, 1,
ARRAY[source_id, target_id]
FROM edges WHERE source_id = 552506
UNION ALL
SELECT sg.source, e.target_id, sg.distance + 1,
path || ARRAY[e.target_id]
FROM search_graph sg
JOIN edges e ON sg.target = e.source_id
WHERE NOT e.target_id = ANY(path) AND distance < 4
)
SELECT * FROM search_graph WHERE target = 530556 LIMIT 100;
Recursive Queries
Problem: Find paths between two nodes in graph WITH RECURSIVE search_graph(source,target,distance,path) AS (
SELECT source_id, target_id, 1,
ARRAY[source_id, target_id]
FROM edges WHERE source_id = 552506
UNION ALL
SELECT sg.source, e.target_id, sg.distance + 1,
path || ARRAY[e.target_id]
FROM search_graph sg
JOIN edges e ON sg.target = e.source_id
WHERE NOT e.target_id = ANY(path) AND distance < 4
)
SELECT * FROM search_graph WHERE target = 530556 LIMIT 100;
Recursive Queries
Problem: Find paths between two nodes in graph WITH RECURSIVE search_graph(source,target,distance,path) AS (
SELECT source_id, target_id, 1,
ARRAY[source_id, target_id]
FROM edges WHERE source_id = 552506
UNION ALL
SELECT sg.source, e.target_id, sg.distance + 1,
path || ARRAY[e.target_id]
FROM search_graph sg
JOIN edges e ON sg.target = e.source_id
WHERE NOT e.target_id = ANY(path) AND distance < 4
)
SELECT * FROM search_graph WHERE target = 530556 LIMIT 100;
Recursive queries
Recursive queries
Graph with ~1M edges (61ms)
source; target; distance; path
530556; 552506; 2; {530556,185423,552506}
JUDr. Robert Kaliňák -> FoodRest s.r.o. -> Ing. Ján Počiatek
530556; 552506; 2; {530556,183291,552506}
JUDr. Robert Kaliňák -> FoRest s.r.o. -> Ing. Ján Počiatek
530556; 552506; 4; {530556,183291,552522,185423,552506}
JUDr. Robert Kaliňák -> FoodRest s.r.o. -> Lena Sisková -> FoRest s.r.o. -> Ing. Ján Počiatek
Window functions
“Aggregate functions without grouping” avg, count, sum, rank, row_number, ntile…
Problem: Find closest nodes to a given node Order by sum of path scores Path score = 0.9^<distance> / log(1 + <number of paths>)
SELECT source, target FROM (
SELECT source, target, path, distance,
0.9 ^ distance / log(1 +
COUNT(*) OVER (PARTITION BY distance,target)
) AS score
FROM ( … ) AS paths
) as scored_paths
GROUP BY source, target ORDER BY SUM(score) DESC
Window functions
“Aggregate functions without grouping” avg, count, sum, rank, row_number, ntile…
Problem: Find closest nodes to a given node Order by sum of path scores Path score = 0.9^<distance> / log(1 + <number of paths>)
SELECT source, target FROM (
SELECT source, target, path, distance,
0.9 ^ distance / log(1 +
COUNT(*) OVER (PARTITION BY distance,target)
) AS score
FROM ( … ) AS paths
) as scored_paths
GROUP BY source, target ORDER BY SUM(score) DESC
Window functions
“Aggregate functions without grouping” avg, count, sum, rank, row_number, ntile…
Problem: Find closest nodes to a given node Order by sum of path scores Path score = 0.9^<distance> / log(1 + <number of paths>)
SELECT source, target FROM (
SELECT source, target, path, distance,
0.9 ^ distance / log(1 +
COUNT(*) OVER (PARTITION BY distance, target)
) AS n
FROM ( … ) AS paths
) as scored_paths
GROUP BY source, target ORDER BY SUM(score) DESC
Window functions
“Aggregate functions without grouping” avg, count, sum, rank, row_number, ntile…
Problem: Find closest nodes to a given node Order by sum of path scores Path score = 0.9^<distance> / log(1 + <number of paths>)
SELECT source, target FROM (
SELECT source, target, path, distance,
0.9 ^ distance / log(1 +
COUNT(*) OVER (PARTITION BY distance, target)
) AS score
FROM ( … ) AS paths
) as scored_paths
GROUP BY source, target ORDER BY SUM(score) DESC
Window functions
“Aggregate functions without grouping” avg, count, sum, rank, row_number, ntile…
Problem: Find closest nodes to a given node Order by sum of path scores Path score = 0.9^<distance> / log(1 + <number of paths>)
SELECT source, target FROM (
SELECT source, target, path, distance,
0.9 ^ distance / log(1 +
COUNT(*) OVER (PARTITION BY distance, target)
) AS score
FROM ( … ) AS paths
) AS scored_paths
GROUP BY source, target ORDER BY SUM(score) DESC
Window functions
Example: Closest to Róbert Kaliňák "Bussines Park Bratislava a.s."
"JARABINY a.s."
"Ing. Robert Pintér"
"Ing. Ján Počiatek"
"Bratislava trade center a.s.“
…
1M edges, 41ms
Additional resources
www.postgresql.org
Read the docs, seriously
www.explainextended.com
SQL guru blog
explain.depesz.com
First aid for slow queries
www.wikivs.com/wiki/MySQL_vs_PostgreSQL
MySQL vs. PostgreSQL comparison