tales from production with postgresql at scale

35
Tales from production with PostgreSQL at scale FossAsia 2016 / PgDay Asia 2016 @ Singapore Presented by Sivakumar Soumya Ranjan Subudhi [email protected] [email protected] 19 March 2016

Upload: soumya-ranjan-subudhi

Post on 23-Jan-2017

146 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Tales from production with postgreSQL at scale

Tales from production with PostgreSQL at scale

FossAsia 2016 / PgDay Asia 2016 @ SingaporePresented bySivakumarSoumya Ranjan Subudhi

[email protected] [email protected] March 2016

Page 2: Tales from production with postgreSQL at scale

Worlds largest Independent mobile ad network 2.2Trillion ad requests per year 1 Billion unique users in our network 720 Billion total ads served

Page 3: Tales from production with postgreSQL at scale

Database @ InMobi

OLTP OLAP

Page 4: Tales from production with postgreSQL at scale

Database @ InMobi

Average 1.5 Billion Transactions Per Day across the clusters

Average 18-22k QPS with a peak of 58k QPS 5 min Average Write Duration < 8ms 5 min Average Select Duration < 90 ms Warehouse Size of 14 TB Streaming Replication across 6 DC’s around the world

with WAL files in the order of 5 per sec including AWS

Page 5: Tales from production with postgreSQL at scale
Page 6: Tales from production with postgreSQL at scale

Today’s Agenda

User connections Idle Transactions Replication issues Temporary file limit Out Of Memory issue Partitions Tablespaces on Master and slave SSH Tunneling Miscellaneous

Page 7: Tales from production with postgreSQL at scale

User Connections

Database

C 1

C 3

C 2Direct Connections

Concurrent Connections

C 4

C 5

Page 8: Tales from production with postgreSQL at scale

User Connections

Increasing max_connections to a higher number

Page 9: Tales from production with postgreSQL at scale

Increased Connections ?

More RAM Usage Processes compete for resources Throughput falls Latency affected

FATAL: too many connections for role ”readuser"

Page 10: Tales from production with postgreSQL at scale

DatabaseConnection

Pool(pgbouncer)

Clients / Applications

• Online restart/upgrade without stopping client connections• Online reconfiguration of most of settings

Page 11: Tales from production with postgreSQL at scale

User Connections

If not using db pooling : Enable client application pooling (Java,Hibernate,..) Avoid hang of connections Applications to be on same colo Good network bandwidth between hosts Giving each component(application) a separate user Improve performance by allocating more resources,

increasing RAM and CPU, use of SSDs

Page 12: Tales from production with postgreSQL at scale

Idle in transactions

Why idle in transactions ?

#ps-ef | grep postgres | grep idle

Idle in transaction in slony

postgres: user db 127.0.0.1(55658) idle in transaction

Page 13: Tales from production with postgreSQL at scale

Idle in transactions

Alerting on idle in transaction Add a auto kill job – Careful

select * from pg_stat_activity where state = 'idle in transaction’;

select pg_terminate_backend(pid) Avoid using

# kill -9 <pid of process>

Page 14: Tales from production with postgreSQL at scale

Long running queries &

Same queries running multiple times for more than 1 hour

Page 15: Tales from production with postgreSQL at scale

Long running queries …

Explain Analyze on the query Execution plan and cost of plan Missing indexes Partition pruning Statement timeout

statement_timeout = 3600000 (1 hour, in milliseconds) Checking if we are bottleneck on RAM,CPU

Page 16: Tales from production with postgreSQL at scale

Temporary file limit issue Temporary file limit issue due to bad joins in query How work_mem related ?

SELECT temp_files "Number of temporary files” , temp_bytes "Size of temporary files” FROM pg_stat_database psd;

Memory2MB work_mem = 1MB

Page 17: Tales from production with postgreSQL at scale

Temporary file limit issue … temp_file_limit = -1 (default) – No Limit

limit on per-session usage of temporary files for sorts, hashes, and similar operations

Can be set to 20GB / 10 % of Disk space available whichever is less.

Page 18: Tales from production with postgreSQL at scale

OOM Error

ERROR: out of memory DETAIL: Failed on request of size

Postgres Call

malloc( )

Kernel Responds

NULL

OS level memory hit limit

Page 19: Tales from production with postgreSQL at scale

OOM Error …

Changes in configs : Kernel.shmmax Kernel.shmall shared_buffers

Rechecking the queries

Page 20: Tales from production with postgreSQL at scale

Replication related issues

Page 21: Tales from production with postgreSQL at scale

FATAL: requested WAL segment 00000002000032A80000002B has already been removed

Calculate numbers of files created each 16MB in size Calculate network speed Disk space available at master Set wal_keep_segments

Page 22: Tales from production with postgreSQL at scale

FATAL: could not send data to WAL stream: server closed the connection unexpectedly

Transient issue Issue with NIC , TOR

Page 23: Tales from production with postgreSQL at scale

xlog filling the disk due to failure of archive_command

Running out of space in pg_xlog Loss of recovery related benefits Slave getting out of sync

Page 24: Tales from production with postgreSQL at scale

Few other issues with replication …

PANIC: WAL contains references to invalid pages FATAL: could not open file "pg_xlog/00000006.history” FATAL: hot standby is not possible because max_connections =

100 is a lower setting than on the master server (its value was 500)

FATAL: base backup could not send data, aborting backup

Page 25: Tales from production with postgreSQL at scale

Partitions

Page 26: Tales from production with postgreSQL at scale

PostgreSQL partitions

Need for it Rule based A partition key Adding constraints

Page 27: Tales from production with postgreSQL at scale

Inserting data into partitions

INSERT <oid> <count> INSERT 0 123 INSERT 0 0

Page 28: Tales from production with postgreSQL at scale

too many partitions and max_locks_per_transaction issue

max_locks_per_transaction = 64 (default) Check on locks Look at query plans

Page 29: Tales from production with postgreSQL at scale

Tables frequently updated

autovacuum_enabled=true, autovacuum_vacuum_threshold=50000,autovacuum_analyze_threshold=50000, autovacuum_vacuum_scale_factor=0.1, autovacuum_analyze_scale_factor=0.2

Page 30: Tales from production with postgreSQL at scale

Tablespace creation on master and slave

Addition of more disks Tablespace creation on master and slaves

Page 31: Tales from production with postgreSQL at scale

Reading blocks and pages

Data corrupted Index corrupted Recreate indexes

ERROR: could not read block xxx of relation base/xxx/xxx: I/O error

ERROR: could not read block xxx in file "base/xxx/xxx"

PANIC: _bt_restore_page: cannot add item to page

Page 32: Tales from production with postgreSQL at scale

Cache Lookup

Cache lookup failure for index during pg_dump Data corrupted

Page 33: Tales from production with postgreSQL at scale

Secure TCP/IP Connections with SSH Tunnels

ssh -L 3333:foo.com:5432 [email protected] ssh –C -L 3333:foo.com:5432 [email protected] psql -h localhost -p 3333 postgres pg_basebackup -D /data-dir/ -p 3333 -U

replicationuser -h localhost -v

Page 34: Tales from production with postgreSQL at scale

Socket connection issue

umount -f and mount the disks - causing all socket connections to fail

Page 35: Tales from production with postgreSQL at scale