PostgreSQL Debugging#

When PostgreSQL breaks, it usually falls into a handful of patterns. This is a reference for diagnosing each one with specific queries and commands.

Connection Refused#

Work through these in order:

1. Is PostgreSQL running?

sudo systemctl status postgresql-16

2. Is it listening on the right address?

ss -tlnp | grep 5432

If it shows 127.0.0.1:5432 but you need remote access, set listen_addresses = '*' in postgresql.conf.

3. Does pg_hba.conf allow the connection? Check logs for no pg_hba.conf entry for host:

sudo tail -50 /var/log/postgresql/postgresql-16-main.log

4. Firewall? Check iptables -L -n | grep 5432 or firewall-cmd --list-ports.

Too Many Connections#

Check current connections:

SELECT count(*), state, usename FROM pg_stat_activity GROUP BY state, usename ORDER BY count DESC;

Find and kill stale idle connections:

SELECT pid, usename, state, now() - state_change AS idle_duration
FROM pg_stat_activity WHERE state = 'idle' ORDER BY idle_duration DESC;

SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle' AND state_change < now() - interval '10 minutes' AND pid <> pg_backend_pid();

Prevent it: Set idle_in_transaction_session_timeout = 60000 in postgresql.conf and use PgBouncer for connection pooling.

Slow Queries#

Find currently running slow queries:

SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC;

Cancel or terminate:

SELECT pg_cancel_backend(12345);     -- cancels the query
SELECT pg_terminate_backend(12345);  -- kills the connection

Find historically slow queries with pg_stat_statements:

SELECT calls, round(mean_exec_time::numeric, 2) AS mean_ms,
  round(total_exec_time::numeric, 2) AS total_ms, substr(query, 1, 120) AS query
FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;

Then use EXPLAIN (ANALYZE, BUFFERS) on the offending query. Look for sequential scans on large tables and high “Rows Removed by Filter” counts.

Lock Contention#

Queries hang without returning. Find blocked and blocking queries:

SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query,
  blocking.pid AS blocking_pid, blocking.query AS blocking_query, blocking.state
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid AND NOT bl.granted
JOIN pg_locks gl ON gl.locktype = bl.locktype
  AND gl.database IS NOT DISTINCT FROM bl.database
  AND gl.relation IS NOT DISTINCT FROM bl.relation
  AND gl.pid <> bl.pid AND gl.granted
JOIN pg_stat_activity blocking ON blocking.pid = gl.pid;

Common causes: ALTER TABLE waiting for long queries to finish, idle in transaction holding a lock. Fix by terminating the blocker: SELECT pg_terminate_backend(<blocking_pid>);

For DDL in production, always set a lock timeout:

SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN new_col text;

Disk Space Issues#

Find largest tables:

SELECT schemaname || '.' || relname AS table_name,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_stat_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;

Check WAL accumulation:

du -sh /var/lib/postgresql/16/main/pg_wal/

Excessive WAL usually means inactive replication slots or failing archive_command. Check pg_replication_slots and drop unused slots.

OOM Kills#

journalctl -k | grep -i "oom\|killed process"

Common causes: shared_buffers too high, work_mem too high with many concurrent queries, too many connections. Fix by reducing these values and using PgBouncer.

Corrupted Indexes#

Symptoms: wrong query results, or different results with SET enable_indexscan = off.

REINDEX INDEX idx_orders_customer;                  -- locks table
REINDEX INDEX CONCURRENTLY idx_orders_customer;     -- PostgreSQL 12+, no lock

Transaction ID Wraparound#

PostgreSQL uses 32-bit transaction IDs. Without vacuuming, it shuts down at ~2 billion to prevent data corruption.

Check proximity:

SELECT datname, age(datfrozenxid) AS xid_age,
  round(age(datfrozenxid)::numeric / 2000000000 * 100, 1) AS pct_to_wraparound
FROM pg_database ORDER BY age(datfrozenxid) DESC;

Above 50% is a problem. Above 80% is an emergency.

Find the blocking table and fix it:

SELECT relname, age(relfrozenxid) AS xid_age FROM pg_stat_user_tables ORDER BY age(relfrozenxid) DESC LIMIT 5;
VACUUM FREEZE VERBOSE <table_name>;

Stuck VACUUM#

A VACUUM running for hours is usually blocked by a long-running transaction that prevents dead tuple cleanup.

SELECT pid, xact_start, now() - xact_start AS duration, state, query
FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_start ASC LIMIT 5;

The oldest transaction sets the xmin horizon. If it is an idle in transaction from hours ago, terminate it. Then VACUUM will make progress.