indexing complex postgresql data types

86
Indexing Complex PostgreSQL Data Types Jonathan S. Katz PGDay UK 2014 July 9, 2014

Upload: jkatz05

Post on 10-May-2015

7.885 views

Category:

Technology


4 download

DESCRIPTION

PostgreSQL comes built-in with a variety of indexes, some of which are further extensible to build powerful new indexing schemes. But what are all these index types? What are some of the special features of these indexes? What are the size & performance tradeoffs? How do I know which ones are appropriate for my application? Fortunately, this talk aims to answer all of these questions as we explore the whole family of PostgreSQL indexes: B-tree, expression, GiST (of all flavors), GIN and how they are used in theory and practice.

TRANSCRIPT

Page 1: Indexing Complex PostgreSQL Data Types

Indexing  Complex  PostgreSQL  Data  Types

Jonathan  S.  Katz  -­‐  PGDay  UK  2014  -­‐  July  9,  2014

Page 2: Indexing Complex PostgreSQL Data Types

About

• Jonathan  S.  Katz  – CTO,  VenueBook  – Co-­‐Organizer,  NYC  PostgreSQL  User  Group  – PGConf  NYC  2015  • Mar  25  -­‐  27,  2015  

• New  York  Marriott  Downtown  

• http://nyc.pgconf.us  

–@jkatz05

2

Page 3: Indexing Complex PostgreSQL Data Types

Quick  Overview

• Introductory  Talk  with  demos  and  fun  

• B-­‐trees  

• GiST:  Generalized  Search  Trees  

• GIN:  Generalized  Inverted  Index  

• SP-­‐GiST:  Space  Partitioned  Generalized  Search  Trees

3

Page 4: Indexing Complex PostgreSQL Data Types

Assumptions

• PostgreSQL  9.3+  

• most  will  be  9.0+  

• PostGIS  2.0+  

• Believe  it  will  work  for  most  available  versions  

• PostgreSQL  9.4  beta?

4

Page 5: Indexing Complex PostgreSQL Data Types

The  Beginning

5

Page 6: Indexing Complex PostgreSQL Data Types

The  Beginning

6

4

3

8

7

2

9

1

6

10

6

Page 7: Indexing Complex PostgreSQL Data Types

Find  3

7

4

3

8

7

2

9

1

6

10

5

Page 8: Indexing Complex PostgreSQL Data Types

Find  3

8

4

3

8

7

2

9

1

6

10

5

Page 9: Indexing Complex PostgreSQL Data Types

Find  3

9

4

3

8

7

2

9

1

6

10

5

Page 10: Indexing Complex PostgreSQL Data Types

Find  10

10

4

3

8

7

2

9

1

6

10

5

Page 11: Indexing Complex PostgreSQL Data Types

Find  10

11

4

3

8

7

2

9

1

6

10

5

Page 12: Indexing Complex PostgreSQL Data Types

Find  10

12

4

3

8

7

2

9

1

6

10

5

Page 13: Indexing Complex PostgreSQL Data Types

Find  10

13

4

3

8

7

2

9

1

6

10

5

Page 14: Indexing Complex PostgreSQL Data Types

Find  10

14

4

3

8

7

2

9

1

6

10

5

Page 15: Indexing Complex PostgreSQL Data Types

Find  10

15

4

3

8

7

2

9

1

6

10

5

Page 16: Indexing Complex PostgreSQL Data Types

Find  10

16

4

3

8

7

2

9

1

6

10

5

Page 17: Indexing Complex PostgreSQL Data Types

Find  10

17

4

3

8

7

2

9

1

6

10

5

Page 18: Indexing Complex PostgreSQL Data Types

Find  10

18

4

3

8

7

2

9

1

6

10

5

Page 19: Indexing Complex PostgreSQL Data Types

Find  10

19

4

3

8

7

2

9

1

6

10

5

Page 20: Indexing Complex PostgreSQL Data Types

What  We  Learned

• Without  any  data  structure  around  search,  we  rely  on  "hope"  

• Assumed  "unique  values"  and  "equality"  – would  have  to  scan  all  rows  otherwise  

• …and  what  about:  – INSERT  – UPDATE  – DELETE

20

Page 21: Indexing Complex PostgreSQL Data Types

What  We  Need

• Need  a  data  structure  for  search  that:  – allows  efficient  lookups  – plays  nicely  with  disk  I/O  – does  not  take  too  long  for  updates  

21

Page 22: Indexing Complex PostgreSQL Data Types

B-­‐Trees

• "default"  index  • quick  traversal  to  leaf  nodes  • leaf  nodes  pre-­‐sorted  • node  size  designed  to  fit  in  disk  block  size  – "degree"  of  nodes  has  max-­‐size  

• theoretical  performance  – reads:  O(log  n)  – writes:  O(log  n)  – space:  O(n)

22

Page 23: Indexing Complex PostgreSQL Data Types

B-­‐Trees  and  PostgreSQL

• supports  – <=,  <,  =,  >,  >=  

– BETWEEN,  IN  

– IS  NOT  NULL,  IS  NULL  

– LIKE  in  specific  case  of  ‘plaintext%’  

– ~  in  specific  case  of  ‘^plaintext’  

– ILIKE  and  ~*  if  pattern  starts  with  nonalpha  characters  

• does  not  support  

• IS  NOT  DISTINCT  FROM

23

Page 24: Indexing Complex PostgreSQL Data Types

B-­‐Trees  and  PostgreSQL

• data  types  supported  – any  data  type  with  all  the  equality  operators  defined  – number  types  

• integer,  numeric,  decimal  

– text  • char,  varchar,  text  

– date  /  times  • timestamptz,  timestamp,  date,  time,  timetz,  interval  

- arrays,  ranges

24

Page 25: Indexing Complex PostgreSQL Data Types

Demo  Specs  +  Configuration

• Hardware  specs  – 2.3GHz  Intel  i7  2x4  core  – 16GB  RAM  DDR3  1600MHz  – Apple  SSD  6Gbps  

• postgresql.conf  (9.3,  9.4  beta  1)  – shared_buffers  =  1GB  – work_mem  =  64MB  – maintenance_work_mem  =  1024MB  – effective_cache_size  =  8GB

25

Page 26: Indexing Complex PostgreSQL Data Types

Demo  #1:  Basic  Indexing  Plans

26

Page 27: Indexing Complex PostgreSQL Data Types

Demo  #1  Notes

• Index  maintenance  • VACUUM  –  "cleans  up"  after  writes  on  table  /indexes  

– ANALYZE  –  keeps  statistics  up-­‐to-­‐date  for  planner  !VACUUM ANALYZE tablename;  !

• Good  idea  to  leave  autovacuum  on

27

Page 28: Indexing Complex PostgreSQL Data Types

Indexing  in  Production

• CREATE  INDEX  CONCURRENTLY  • REINDEX  – corruption,  bloat,  invalid  

• FILLFACTOR  – 10  –  100  – default:  90  – strategy:  lower  %  ::  write-­‐activity  

• TABLESPACE  • NULLS  LAST,  NULLS  FIRST

28

Page 29: Indexing Complex PostgreSQL Data Types

Demo  #2:  Partial  Indexes

29

CREATE INDEX indexname ON tablename (columnname) WHERE somecondition;

Page 30: Indexing Complex PostgreSQL Data Types

Demo  #2  Notes

• Partial  Indexes  are  – good  if  known  to  query  limited  subset  of  table  

– take  up  less  space  – allow  for  much  quicker  writes  

• Like  all  good  things,  do  not  overuse  and  saturate  your  I/O

30

Page 31: Indexing Complex PostgreSQL Data Types

Unique  Indexes

• only  for  B-­‐trees  • NULL  not  unique  • use  UNIQUE  constraints  –  automatically  create  indexes  

!CREATE TABLE foo (bar int UNIQUE);  -- or  CREATE UNIQUE INDEX foo_bar_idx ON foo (bar);  ALTER TABLE foo ADD CONSTRAINT a_unique USING INDEX a_idx;

31

Page 32: Indexing Complex PostgreSQL Data Types

Multi-­‐Column  Indexes

• Useful  for  – querying  two  columns  that  are  frequently  queried  together  

– enforcing  UNIQUEness  across  columns  • n.b.  creating  UNIQUE  constraint  on  table  creates  UNIQUE  INDEX  

• PostgreSQL  supports  – up  to  32  columns  – B-­‐tree,  GiST,  GIN  

• Be  careful  of  how  you  choose  initial  column  order!

32

Page 33: Indexing Complex PostgreSQL Data Types

Multi-­‐Column  Indexes

33

CREATE INDEX multicolumn_idx ON tablename (col1, col2);

!!!CREATE UNIQUE INDEX pn_idx ON phone_numbers (country_code, national_number) WHERE extension IS NULL

Page 34: Indexing Complex PostgreSQL Data Types

Demo  #3  Notes

• Multi-­‐column  indexes  can  be  – efficient  for  speed  +  space  – inefficient  with  performance  

• Usage  depends  on  your  application  needs

34

Page 35: Indexing Complex PostgreSQL Data Types

Expression  Indexes

• can  index  on  expressions  to  speed  up  lookups  – e.g.  case  insensitive  email  addresses  

– can  use  functions  or  scalars  • (x * y) / 100  • COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')  

• LOWER(email_address)  

• tradeoff:  slower  writes

35

Page 36: Indexing Complex PostgreSQL Data Types

Demo  #4:  Expression  Indexes

36

Page 37: Indexing Complex PostgreSQL Data Types

Demo  #4  Notes

• fast  lookups,  slow  updates  

• size  on  disk  

• cool  data  types  – geometric  

– JSON

37

Page 38: Indexing Complex PostgreSQL Data Types

Geometric  Data  Types

38

Page 39: Indexing Complex PostgreSQL Data Types

Geometric  Data  Types

CREATE TABLE points (coord point);

!CREATE INDEX points_idx ON points (coord); ERROR: data type point has no default operator class for access method "btree"

HINT: You must specify an operator class for the index or define a default operator class for the data type.

39

Page 40: Indexing Complex PostgreSQL Data Types

GiST

• "generalized  search  tree"  • infrastructure  that  provides  template  to  create  arbitrary  indexing  schemes  – supports  concurrency,  logging,  searching  –  only  have  to  define  behavior  

– user-­‐defined  operator  class  • <<,  &<,  &>,  >>,  <<|,  &<|,  |&>,  |>>,  @>,  <@,  ~=,  &&  

– have  to  implement  functions  in  interface  • supports  lossless  +  lossy  indexes  • provides  support  for  "nearest-­‐neighbor"  queries  –  "KNN-­‐Gist"

40

CREATE INDEX points_coord_gist_idx ON points USING gist(coord)

Page 41: Indexing Complex PostgreSQL Data Types

Demo  #5:  Geometric  GiST  +  KNN-­‐Gist

41

Page 42: Indexing Complex PostgreSQL Data Types

Demo  #5  Notes

• GiST  indexes  on  geometric  types  radically  speedup  reads  

• Writes  are  slower  due  to  distance  calculation  

• Index  size  can  be  very  big

42

Page 43: Indexing Complex PostgreSQL Data Types

PostGIS

• For  when  you  are  doing  real  things  with  shapes

43• (and  geographic  information  systems)

Page 44: Indexing Complex PostgreSQL Data Types

PostGIS  +  Indexes

• B-­‐Tree?  

• R-­‐Tree?  

• PostGIS  docs  do  not  recommend  using  just  an  R-­‐Tree  index  

• GiST  

• overlaps!    containment!  

• uses  a  combination  of  GiST  +  R-­‐Tree

44

Page 45: Indexing Complex PostgreSQL Data Types

PostGIS  +  GiST

45

2-­‐D

CREATE INDEX zipcodes_geom_gist_idx ON zipcodes USING gist(geom);

N-­‐D  (PostGIS  2.0+

CREATE INDEX zipcodes_geom_gist_idx ON zipcodes USING gist(geom gist_geometry_ops_nd);

Page 46: Indexing Complex PostgreSQL Data Types

Example  -­‐  USA  Zipcode  Boundaries

• 33,120  rows  

• geom:  MultiPolygon  

• 52MB  without  indexes  

• With  geometry  GiST  +  integer  B-­‐Tree:  869MB

46

Page 47: Indexing Complex PostgreSQL Data Types

What  Zipcode  Is  My  Office  In?

• Geocoded  Address  

• Lat,Long  =  40.7356197,-­‐73.9891102  

• PostGIS:  POINT(-­‐73.9891102  40.7356197)  

• 4269  -­‐  “SRID”  -­‐  unique  ID  for  coordinate  system  definitions

47

SELECT zcta5ce10 AS zipcode FROM zipcodes WHERE ST_Contains( geom, --MultiPolygon ST_GeomFromText('POINT(-73.9891102 40.7356197)', 4269) );

Page 48: Indexing Complex PostgreSQL Data Types

What  Zipcode  Is  My  Office  In?

• No  Index

48

Seq Scan on zipcodes (cost=0.00..15382.00 rows=1 width=6) (actual time=64.780..5153.485 rows=1 loops=1) Filter: ((geom && '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry) AND _st_contains(geom, '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry)) Rows Removed by Filter: 33119 Total runtime: 5153.505 ms

Page 49: Indexing Complex PostgreSQL Data Types

What  Zipcode  Is  My  Office  In?

• Here’s  the  GiST:

49

Index Scan using zipcodes_geom_gist on zipcodes (cost=0.28..8.54 rows=1 width=6) (actual time=0.120..0.207 rows=1 loops=1) Index Cond: (geom && '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry) Filter: _st_contains(geom, '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry) Rows Removed by Filter: 1 ! Total runtime: 0.235 ms

Page 50: Indexing Complex PostgreSQL Data Types

PostGIS  +  GiST  Conclusion

• 5153ms  =>  0.235ms  =  21,900x  speedup  :-­‐)

50

SELECT zcta5ce10 AS zipcode FROM zipcodes WHERE ST_Contains( geom, ST_GeomFromText('POINT(-73.9891102 40.7356197)', 4269) ); ! zipcode --------- 10003

Page 51: Indexing Complex PostgreSQL Data Types

Full  Text  Search

• PostgreSQL  offers  full  text  search  with  the  tsearch2  engine  – algorithms  for  performing  FTS  

– to_tsvector('english',  content)  @@  to_tsquery('irish  &  conference  |  meeting')  

– provides  indexing  capabilities  for  efficient  search

51

Page 52: Indexing Complex PostgreSQL Data Types

Test  Data  Set

• Wikipedia  English  category  titles  –  all  1,823,644  that  I  downloaded

52

Page 53: Indexing Complex PostgreSQL Data Types

Full-­‐Text  Search:  Basics

53

SELECT title FROM category WHERE to_tsvector('english', title) @@ to_tsquery('united & kingdom’); ! title ----- Lists of railway stations in the United Kingdom Political history of the United Kingdom Military of the United Kingdom United Kingdom constitution Television channels in the United Kingdom United Kingdom Roman Catholic secondary schools in the United Kingdom [results truncated] !!QUERY PLAN ------------ Seq Scan on category (cost=0.00..49262.77 rows=46 width=29) (actual time=21.900..16809.890 rows=8810 loops=1) Filter: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) Rows Removed by Filter: 1814834 !Total runtime: 16811.108 ms

Page 54: Indexing Complex PostgreSQL Data Types

Full-­‐Text  Search  +  GiST

54

CREATE INDEX category_title_gist_idx ON category USING gist(to_tsvector('english', title)); !SELECT title FROM category WHERE to_tsvector('english', title) @@ to_tsquery('united & kingdom'); QUERY PLAN ------------- Bitmap Heap Scan on category (cost=4.77..182.47 rows=46 width=29) (actual time=75.517..180.650 rows=8810 loops=1) Recheck Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) -> Bitmap Index Scan on category_title_gist_idx (cost=0.00..4.76 rows=46 width=0) (actual time=74.687..74.687 rows=8810 loops=1) Index Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom’::text)) !Total runtime: 181.354 ms

Page 55: Indexing Complex PostgreSQL Data Types

Full  Text  Search  +  GiST

• GiST  indexes  can  produce  false  positives  – "documents"  represented  by  fixed  length  signature  • words  are  hashed  into  single  bits  and  concatenated  

– when  false  positive  occurs,  row  is  returned  and  checked  to  see  if  false  match  

• Extra  validations  =  performance  degradation

55

Page 56: Indexing Complex PostgreSQL Data Types

Performance  Summary  with  GiST

• initial  index  build  takes  awhile  =>  slow  writes  

• reads  are  quick  

• Table  size:  271MB  

• Index  size:  83MB

56

Page 57: Indexing Complex PostgreSQL Data Types

GIN  Index

• "generalized  inverted  index"  

• supports  searching  within  composite  data  – arrays,  full-­‐text  documents,  hstore  

• key  is  stored  once  and  points  to  composites  it  is  contained  in  

• like  GiST,  provides  index  infrastructure  to  extend  GIN  based  on  behavior  – supports  operators  <@,  @>,  =,  &&  

• GIN  performance  ⬄  log(#  unique  things)57

Page 58: Indexing Complex PostgreSQL Data Types

Full  Text  Search  +  GIN

58

CREATE INDEX category_title_gin_idx ON category USING gin(to_tsvector('english', title)); !EXPLAIN ANALYZE SELECT title FROM category WHERE to_tsvector('english', title) @@ to_tsquery('united & kingdom'); !QUERY PLAN ------- Bitmap Heap Scan on category (cost=28.36..206.06 rows=46 width=29) (actual time=8.864..14.674 rows=8810 loops=1) Recheck Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) -> Bitmap Index Scan on category_title_gin_idx (cost=0.00..28.35 rows=46 width=0) (actual time=7.905..7.905 rows=8810 loops=1) Index Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text))

!!Total runtime: 15.157 ms

Page 59: Indexing Complex PostgreSQL Data Types

Performance  Summary  with  GIN

• index  build  was  much  quicker  

• significant  speedup  from  no  index  – (12,000ms  =>  15ms)  

• significant  speedup  from  GiST  – (181ms  =>  15ms)  

• Table  size:  271MB  

• Index  size:  

• 9.3:  71MB  

• 9.4  beta  1:    40MB

59

Page 60: Indexing Complex PostgreSQL Data Types

What  Was  Not  Discussed

• Word  density  – prior  to  9.3,  performance  issues  with  greater  word  density  

• Type  of  text  data  –  phrases  vs  paragraphs

60

Page 61: Indexing Complex PostgreSQL Data Types

Full  Text  Search  –  GiST  vs  GIN

• Reads  – overall,  GIN  should  win  

• Writes  – traditionally,  GiST  has  better  performance  for  writes  

– GIN  • FASTUPDATE  • 9.4:  compression

61

Page 62: Indexing Complex PostgreSQL Data Types

Regular  Expression  Indexes

• Added  in  9.3  

• Support  for  LIKE/ILIKE  wildcard  indexes  in  9.1  – title  LIKE  '%ab%e'  

• Uses  pg_trgm  extension  +  GIN  

!CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX category_title_regex_idx ON category USING GIN(title gin_trgm_ops);

62

Page 63: Indexing Complex PostgreSQL Data Types

Regular  Expressions  -­‐  No  Index

63

EXPLAIN ANALYZE SELECT title FROM category WHERE title ~ '(([iI]sland(s)?)|([pP]eninsula))$'; !QUERY PLAN ---------- Seq Scan on category (cost=0.00..40144.55 rows=182 width=29) (actual time=2.509..4260.792 rows=5878 loops=1) Filter: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text) Rows Removed by Filter: 1817766 !Total runtime: 4261.204 ms

Page 64: Indexing Complex PostgreSQL Data Types

Regular  Expressions  -­‐  Indexed

64

CREATE INDEX category_title_regex_idx ON category USING gin(title gin_trgm_ops); !EXPLAIN ANALYZE SELECT title FROM category WHERE title ~ '(([iI]sland(s)?)|([pP]eninsula))$'; QUERY PLAN ----------- Bitmap Heap Scan on category (cost=197.41..871.77 rows=182 width=29) (actual time=107.445..146.713 rows=5878 loops=1) Recheck Cond: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text) Rows Removed by Index Recheck: 4712 -> Bitmap Index Scan on category_title_regex_idx (cost=0.00..197.37 rows=182 width=0) (actual time=106.645..106.645 rows=10590 loops=1) Index Cond: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text)

!!Total runtime: 147.026 ms

Page 65: Indexing Complex PostgreSQL Data Types

Range  Types

• stores  range  data  –  1  to  6  – 2013-­‐10-­‐29  –  2013-­‐11-­‐2  

• easy-­‐to-­‐use  operators  to  check  inclusion,  overlaps  

• built-­‐in  types:  integers,  numerics,  dates,  timestamps  

• extensible65

Page 66: Indexing Complex PostgreSQL Data Types

Range  Type  Examples

66

--find all ranges that overlap with [100, 200) !SELECT * FROM ranges WHERE int4range(100, 200) && range; !range ----------- [10,102) [13,102) [18,108) [32,101) [34,134) [37,123) [43,111) [46,132) [48,107) [results trunctated] !QUERY PLAN ----------- Seq Scan on ranges (cost=0.00..14239.86 rows=7073 width=32) (actual time=0.018..185.411 rows=143 loops=1) Filter: ('[100,200)'::int4range && range) Rows Removed by Filter: 999857 ! Total runtime: 185.439 ms

Page 67: Indexing Complex PostgreSQL Data Types

Range  Types  +  GiST

67

CREATE INDEX ranges_range_gist_idx ON ranges USING gist(range); !EXPLAIN ANALYZE SELECT * FROM ranges WHERE int4range(100, 200) && range; !QUERY PLAN ------------ Bitmap Heap Scan on ranges (cost=5.29..463.10 rows=130 width=13) (actual time=0.120..0.135 rows=144 loops=1) Recheck Cond: ('[100,200)'::int4range && range) -> Bitmap Index Scan on ranges_range_gist_idx (cost=0.00..5.26 rows=130 width=0) (actual time=0.109..0.109 rows=144 loops=1) Index Cond: ('[100,200)'::int4range && range) !!Total runtime: 0.168 ms

Page 68: Indexing Complex PostgreSQL Data Types

SP-­‐GiST

• space-­‐partitioned  generalized  search  tree  

• ideal  for  non-­‐balanced  data  structures  – k-­‐d  trees,  quad-­‐trees,  suffix  trees  – divides  search  space  into  partitions  of  unequal  size  • matching  partitioning  rule  =  fast  search  

• traditionally  for  "in-­‐memory"  transactions,  converted  to  play  nicely  with  I/O

68

Page 69: Indexing Complex PostgreSQL Data Types

Range  Types:  GiST  vs  SP-­‐Gist

CREATE TABLE ranges AS SELECT int4range( (random()*5)::int, (random()*5)::int + 5 ) AS range FROM generate_series(1,<N>) x;

!SELECT * FROM ranges WHERE range <operator> int4range(3,6);

69

Page 70: Indexing Complex PostgreSQL Data Types

N  =  1,000,000

70

CREATE INDEX ranges_range_spgist_idx ON ranges USING spgist(range); ERROR: unexpected spgdoinsert() failure Fixed in 9.3.2

GiST  Used GiST  Time SP-Gist Used SP-GiST Time= Yes 121 Yes 37

&& No 257  No 260

@> No 223 No 223<@ Yes 163  Yes 111

<< Yes 95 Yes 5

>> Yes 95 Yes 25

&< No 184 No 185&> No 203 No 203

Page 71: Indexing Complex PostgreSQL Data Types

Range  Types:  GiST  vs  SP-­‐GiST

CREATE TABLE ranges AS SELECT int4range(x, x + (random()*5)::int + 5) AS range

FROM generate_series(1,<N>) x;

71

Page 72: Indexing Complex PostgreSQL Data Types

N  =  250,000

72

GiST  Used GiST  Time SP-­‐GiST  Used SP-­‐GiST  Time= Yes 0.5 Yes 0.7&& Yes 0.3 Yes 0.3@> Yes 0.3 Yes 0.3<@ Yes 0.06 Yes 0.25<< No 40 Yes 0.2>> No 60 No 60&< Yes 0.3 Yes 0.2&> No 74 No 61

Page 73: Indexing Complex PostgreSQL Data Types

GiST  vs  SP-­‐GiST:  Space

73

GiST  Clustered SP-­‐GiST  Clustered GiST  Sparse SP-­‐GiST  Sparse

100K  Size 6MB 5MB 6MB 11MB

100K  Time 0.5s .4s 2.5s 7.8s

250K  Size 15MB 12MB 15MB 28MB

250K  Time 1.5s 1.1s 6.3s 47.2s

500K  Size 30MB 25MB 30MB 55MB500K  Time 3.1s 3.0s 13.9s 192s

1MM  Size 59MB 52MB! 60MB 110MB

1MM  Time 5.1s 5.7s 29.2 777s

Page 74: Indexing Complex PostgreSQL Data Types

Integer  Arrays

74

CREATE UNLOGGED TABLE int_arrays AS SELECT ARRAY[x, x + 1, x + 2] AS data FROM generate_series(1,1000000) x; !CREATE INDEX int_arrays_data_idx ON int_arrays (data); !CREATE INDEX int_arrays_data_gin_idx ON int_arrays USING GIN(data);

Page 75: Indexing Complex PostgreSQL Data Types

B-­‐Tree(?)  +  Integer  Arrays

75

EXPLAIN ANALYZE SELECT * FROM int_arrays WHERE 5432 = ANY (data); QUERY PLAN ----------- Seq Scan on int_arrays (cost=0.00..30834.00 rows=5000 width=33) (actual time=1.260..159.197 rows=3 loops=1) Filter: (5432 = ANY (data)) Rows Removed by Filter: 999997

!Total runtime: 159.222 ms

Page 76: Indexing Complex PostgreSQL Data Types

GIN  +  Integer  Arrays

76

EXPLAIN ANALYZE SELECT * FROM int_arrays WHERE ARRAY[5432] <@ data; QUERY PLAN ----------- Bitmap Heap Scan on int_arrays (cost=70.75..7680.14 rows=5000 width=33) (actual time=0.020..0.021 rows=3 loops=1) Recheck Cond: ('{5432}'::integer[] <@ data) -> Bitmap Index Scan on int_arrays_data_gin_idx (cost=0.00..69.50 rows=5000 width=0) (actual time=0.014..0.014 rows=3 loops=1) Index Cond: ('{5432}'::integer[] <@ data)

!Total runtime: 0.045 ms

Page 77: Indexing Complex PostgreSQL Data Types

Hash  Indexes

• only  work  with  "="  operator  

• are  still  not  WAL  logged  as  of  9.4  beta  1  – not  crash  safe  – not  replicated

77

Page 78: Indexing Complex PostgreSQL Data Types

btree_gin

78

CREATE EXTENSION IF NOT EXISTS btree_gin; !CREATE UNLOGGED TABLE numbers AS SELECT (random() * 2000)::int AS a FROM generate_series(1, 2000000) x; !CREATE INDEX numbers_gin_idx ON numbers USING gin(a); !EXPLAIN ANALYZE SELECT * FROM numbers WHERE a = 1000; !QUERY PLAN ------------ Bitmap Heap Scan on numbers (cost=113.50..9509.26 rows=10000 width=4) (actual time=0.388..1.459 rows=991 loops=1) Recheck Cond: (a = 1000) -> Bitmap Index Scan on numbers_gin_idx (cost=0.00..111.00 rows=10000 width=0) (actual time=0.232..0.232 rows=991 loops=1) Index Cond: (a = 1000)

!Total runtime: 1.563 ms

Page 79: Indexing Complex PostgreSQL Data Types

btree_gin  vs  btree

79

-- btree SELECT pg_size_pretty(pg_total_relation_size('numbers_idx')); pg_size_pretty ---------------- 43 MB !!!-- GIN SELECT pg_size_pretty(pg_total_relation_size('numbers_gin_idx')); pg_size_pretty ---------------- 16 MB

• Only  use  GIN  over  btree  if  you  have  a  lot  of  duplicate  entries

Page 80: Indexing Complex PostgreSQL Data Types

hstore  -­‐  the  PostgreSQL  Key-­‐Value  Store

80

CREATE EXTENSION IF NOT EXISTS hstore; !CREATE UNLOGGED TABLE keypairs AS SELECT (x || ' => ' || (x + (random() * 5)::int))::hstore AS data FROM generate_series(1,1000000) x; SELECT pg_size_pretty(pg_relation_size('keypairs')); !!SELECT * FROM keypairs WHERE data ? ‘3'; data ---------- "3"=>"4" !EXPLAIN ANALYZE SELECT * FROM keypairs WHERE data ? ‘3'; QUERY PLAN ----------- Seq Scan on keypairs (cost=0.00..19135.06 rows=950 width=32) (actual time=0.065..208.808 rows=1 loops=1) Filter: (data ? '3'::text) Rows Removed by Filter: 999999 ! Total runtime: 208.825 ms

Page 81: Indexing Complex PostgreSQL Data Types

hstore  -­‐  the  PostgreSQL  Key-­‐Value  Store

81

CREATE INDEX keypairs_data_gin_idx ON keypairs USING gin(data); !EXPLAIN ANALYZE SELECT * FROM keypairs WHERE data ? ‘3'; !QUERY PLAN ----------- Bitmap Heap Scan on keypairs (cost=27.75..2775.66 rows=1000 width=24) (actual time=0.044..0.045 rows=1 loops=1) Recheck Cond: (data ? '3'::text) -> Bitmap Index Scan on keypairs_data_gin_idx (cost=0.00..27.50 rows=1000 width=0) (actual time=0.039..0.039 rows=1 loops=1) Index Cond: (data ? '3'::text)

!Total runtime: 0.071 ms

Page 82: Indexing Complex PostgreSQL Data Types

JSONB:  Coming  in  9.4

82

INSERT INTO documents SELECT row_to_json(ROW(x, x + 2, x + 3))::jsonb FROM generate_series(1,1000000) x; !!CREATE INDEX documents_data_gin_idx ON documents USING gin(data jsonb_path_ops); !!!SELECT * FROM documents WHERE data @> '{ "f1": 10 }'; data -------------------------------- {"f1": 10, "f2": 12, "f3": 13} !!Execution time: 0.084 ms

Page 83: Indexing Complex PostgreSQL Data Types

Awesome  vs  WTF:  A  Note  On  Operator  Indexability

83

EXPLAIN ANALYZE SELECT * FROM documents WHERE data @> '{ "f1": 10 }'; !QUERY PLAN ----------- Bitmap Heap Scan on documents (cost=27.75..3082.65 rows=1000 width=66) (actual time=0.029..0.030 rows=1 loops=1) Recheck Cond: (data @> '{"f1": 10}'::jsonb) Heap Blocks: exact=1 -> Bitmap Index Scan on documents_data_gin_idx (cost=0.00..27.50 rows=1000 width=0) (actual time=0.014..0.014 rows=1 loops=1) Index Cond: (data @> '{"f1": 10}'::jsonb)

!Execution time: 0.084 ms !EXPLAIN ANALYZE SELECT * FROM documents WHERE '{ "f1": 10 }' <@ data; !QUERY PLAN ----------- Seq Scan on documents (cost=0.00..24846.00 rows=1000 width=66) (actual time=0.015..245.924 rows=1 loops=1) Filter: ('{"f1": 10}'::jsonb <@ data) Rows Removed by Filter: 999999 !Execution time: 245.947 ms

Page 84: Indexing Complex PostgreSQL Data Types

For  More  Information…

• http://www.postgresql.org/docs/current/static/indexes.html  

• http://www.postgresql.org/docs/current/static/gist.html  

• http://www.postgresql.org/docs/current/static/gin.html  

• http://www.postgresql.org/docs/current/static/spgist.html  

• GiST  +  GIN  +  Full  Text  Search:  – http://www.postgresql.org/docs/current/static/textsearch-­‐indexes.html

84

Page 85: Indexing Complex PostgreSQL Data Types

Conclusion

• Postgres  has  *a  lot*  of  different  types  of  indexes,  and  variations  on  each  of  its  engines  

• Extensions  make  use  of  PostgreSQL  indexes  – PostGIS  

• Need  to  understand  where  index  usage  is  appropriate  in  your  application

85

Page 86: Indexing Complex PostgreSQL Data Types

Thanks!

• http://nyc.pgconf.us  

• Twitter:  @jkatz05

86