PostgreSQL Debugging#
When PostgreSQL breaks, it usually falls into a handful of patterns. This is a reference for diagnosing each one with specific queries and commands.
Connection Refused#
Work through these in order:
1. Is PostgreSQL running?
sudo systemctl status postgresql-162. Is it listening on the right address?
ss -tlnp | grep 5432If it shows 127.0.0.1:5432 but you need remote access, set listen_addresses = '*' in postgresql.conf.
3. Does pg_hba.conf allow the connection? Check logs for no pg_hba.conf entry for host:
sudo tail -50 /var/log/postgresql/postgresql-16-main.log4. Firewall? Check iptables -L -n | grep 5432 or firewall-cmd --list-ports.
Too Many Connections#
Check current connections:
SELECT count(*), state, usename FROM pg_stat_activity GROUP BY state, usename ORDER BY count DESC;Find and kill stale idle connections:
SELECT pid, usename, state, now() - state_change AS idle_duration
FROM pg_stat_activity WHERE state = 'idle' ORDER BY idle_duration DESC;
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle' AND state_change < now() - interval '10 minutes' AND pid <> pg_backend_pid();Prevent it: Set idle_in_transaction_session_timeout = 60000 in postgresql.conf and use PgBouncer for connection pooling.
Slow Queries#
Find currently running slow queries:
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC;Cancel or terminate:
SELECT pg_cancel_backend(12345); -- cancels the query
SELECT pg_terminate_backend(12345); -- kills the connectionFind historically slow queries with pg_stat_statements:
SELECT calls, round(mean_exec_time::numeric, 2) AS mean_ms,
round(total_exec_time::numeric, 2) AS total_ms, substr(query, 1, 120) AS query
FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;Then use EXPLAIN (ANALYZE, BUFFERS) on the offending query. Look for sequential scans on large tables and high “Rows Removed by Filter” counts.
Lock Contention#
Queries hang without returning. Find blocked and blocking queries:
SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query,
blocking.pid AS blocking_pid, blocking.query AS blocking_query, blocking.state
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid AND NOT bl.granted
JOIN pg_locks gl ON gl.locktype = bl.locktype
AND gl.database IS NOT DISTINCT FROM bl.database
AND gl.relation IS NOT DISTINCT FROM bl.relation
AND gl.pid <> bl.pid AND gl.granted
JOIN pg_stat_activity blocking ON blocking.pid = gl.pid;Common causes: ALTER TABLE waiting for long queries to finish, idle in transaction holding a lock. Fix by terminating the blocker: SELECT pg_terminate_backend(<blocking_pid>);
For DDL in production, always set a lock timeout:
SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN new_col text;Disk Space Issues#
Find largest tables:
SELECT schemaname || '.' || relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_stat_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;Check WAL accumulation:
du -sh /var/lib/postgresql/16/main/pg_wal/Excessive WAL usually means inactive replication slots or failing archive_command. Check pg_replication_slots and drop unused slots.
OOM Kills#
journalctl -k | grep -i "oom\|killed process"Common causes: shared_buffers too high, work_mem too high with many concurrent queries, too many connections. Fix by reducing these values and using PgBouncer.
Corrupted Indexes#
Symptoms: wrong query results, or different results with SET enable_indexscan = off.
REINDEX INDEX idx_orders_customer; -- locks table
REINDEX INDEX CONCURRENTLY idx_orders_customer; -- PostgreSQL 12+, no lockTransaction ID Wraparound#
PostgreSQL uses 32-bit transaction IDs. Without vacuuming, it shuts down at ~2 billion to prevent data corruption.
Check proximity:
SELECT datname, age(datfrozenxid) AS xid_age,
round(age(datfrozenxid)::numeric / 2000000000 * 100, 1) AS pct_to_wraparound
FROM pg_database ORDER BY age(datfrozenxid) DESC;Above 50% is a problem. Above 80% is an emergency.
Find the blocking table and fix it:
SELECT relname, age(relfrozenxid) AS xid_age FROM pg_stat_user_tables ORDER BY age(relfrozenxid) DESC LIMIT 5;
VACUUM FREEZE VERBOSE <table_name>;Stuck VACUUM#
A VACUUM running for hours is usually blocked by a long-running transaction that prevents dead tuple cleanup.
SELECT pid, xact_start, now() - xact_start AS duration, state, query
FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_start ASC LIMIT 5;The oldest transaction sets the xmin horizon. If it is an idle in transaction from hours ago, terminate it. Then VACUUM will make progress.