Database Performance Investigation Runbook#
When a database is slow, resist the urge to immediately tune configuration parameters. Follow this sequence: identify what is slow, understand why, then fix the specific bottleneck. Most performance problems are caused by missing indexes or a single bad query, not global configuration issues.
Phase 1 – Identify Slow Queries#
The first step is always finding which queries are consuming the most time.
PostgreSQL: pg_stat_statements#
Enable the extension if not already loaded:
-- Check if loaded
SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements';
-- Enable (requires adding to shared_preload_libraries and restart)
ALTER SYSTEM SET shared_preload_libraries = 'pg_stat_statements';
-- Restart PostgreSQL, then:
CREATE EXTENSION pg_stat_statements;Find the top queries by total time:
SELECT
substring(query, 1, 80) AS query_preview,
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS avg_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;This gives you the top 10 queries ranked by cumulative execution time. Focus on these – they represent the biggest opportunities.
To reset stats after making changes (so you can measure the impact): SELECT pg_stat_statements_reset();
MySQL: Slow Query Log and Performance Schema#
Enable the slow query log dynamically:
SET GLOBAL slow_query_log = 1;
SET GLOBAL long_query_time = 0.5;
SET GLOBAL log_queries_not_using_indexes = 1;Analyze accumulated slow queries with pt-query-digest:
pt-query-digest /var/log/mysql/slow.log --limit 10Or query Performance Schema directly for the top queries by total execution time:
SELECT
LEFT(DIGEST_TEXT, 80) AS query_preview,
COUNT_STAR AS calls,
ROUND(SUM_TIMER_WAIT / 1e12, 2) AS total_sec,
ROUND(AVG_TIMER_WAIT / 1e12, 4) AS avg_sec,
SUM_ROWS_EXAMINED AS rows_examined,
SUM_ROWS_SENT AS rows_sent
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;A large gap between rows_examined and rows_sent signals a missing or ineffective index.
Phase 2 – Analyze Execution Plans#
Once you have the slow queries, examine their execution plans.
PostgreSQL: EXPLAIN (ANALYZE, BUFFERS)#
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
AND o.created_at > now() - interval '7 days';What to look for:
- Seq Scan on a large table: Missing index. Check if a WHERE clause column or JOIN column is unindexed.
- Nested Loop with high row count in outer loop: Consider if a Hash Join or Merge Join would be better. The planner may be choosing poorly due to stale statistics – run
ANALYZE tablename;. - Sort with external merge:
work_memtoo low for this query, or add an index that matches the sort order. - Buffers: shared read much larger than shared hit: Data is not in the buffer cache. Either the table is too large for
shared_buffers, or this is a cold query path.
-- Check if table statistics are current
SELECT relname, last_analyze, last_autoanalyze, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_dead_tup DESC;
-- Force statistics refresh
ANALYZE orders;MySQL: EXPLAIN ANALYZE#
EXPLAIN ANALYZE
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
AND o.created_at > DATE_SUB(NOW(), INTERVAL 7 DAY);What to look for:
- type: ALL: Full table scan. Needs an index on the filter columns.
- type: index: Full index scan (reading every row of the index). Better than ALL but still expensive.
- Extra: Using temporary; Using filesort: Query requires a temporary table and/or sort. Consider adding an index that covers the ORDER BY.
- rows: Estimated rows to examine. Compare against actual rows returned. Large disparity means statistics are stale:
ANALYZE TABLE orders;.
Phase 3 – Check Lock Contention#
Slow queries are not always caused by missing indexes. Lock contention can make fast queries wait.
PostgreSQL: Lock Investigation#
-- Find waiting queries and what they are waiting on
SELECT pid, wait_event_type, wait_event, state, query,
age(now(), query_start) AS duration
FROM pg_stat_activity
WHERE wait_event IS NOT NULL AND state != 'idle'
ORDER BY query_start;
-- Find blocked/blocking pairs
SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query,
blocking.pid AS blocking_pid, blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid
JOIN pg_locks kl ON kl.locktype = bl.locktype
AND kl.relation IS NOT DISTINCT FROM bl.relation
AND kl.pid != bl.pid
JOIN pg_stat_activity blocking ON blocking.pid = kl.pid
WHERE NOT bl.granted;Common causes: long-running transactions holding RowExclusiveLock, ALTER TABLE holding AccessExclusiveLock, autovacuum blocking DDL.
MySQL: Lock Investigation#
-- Current lock waits
SELECT * FROM sys.innodb_lock_waits\G
-- If sys schema is unavailable
SELECT
r.trx_id AS waiting_trx,
r.trx_mysql_thread_id AS waiting_pid,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx,
b.trx_mysql_thread_id AS blocking_pid,
b.trx_query AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
-- Metadata locks (DDL vs DML contention)
SELECT * FROM performance_schema.metadata_locks
WHERE LOCK_STATUS = 'PENDING';Common causes: long-running SELECT preventing ALTER TABLE, uncommitted transactions, large UPDATE or DELETE holding row locks.
Phase 4 – Review Connection Pool Saturation#
If the database itself is not slow but applications experience timeouts, the connection pool may be exhausted.
PostgreSQL#
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
SHOW max_connections;
-- Idle in transaction is the worst offender -- holds locks while doing nothing
SELECT pid, usename, query, age(now(), xact_start) AS txn_duration
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start;If idle in transaction persists, set idle_in_transaction_session_timeout = '30s'.
MySQL#
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';If Max_used_connections approaches max_connections, the fix is usually a connection pooler (PgBouncer, ProxySQL) rather than raising the limit.
Phase 5 – Analyze I/O Patterns#
If queries are well-indexed and there is no lock contention, the bottleneck may be disk I/O.
Operating System Level#
# I/O utilization per device
iostat -xz 2
# Key columns:
# %util: percentage of time device is busy (> 80% means saturation)
# await: average I/O wait time in ms (> 10ms for SSD = problem)
# r/s, w/s: reads and writes per secondPostgreSQL: I/O Statistics#
-- Table I/O: sequential vs index scans
SELECT relname,
seq_scan, seq_tup_read,
idx_scan, idx_tup_fetch,
n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
ORDER BY seq_tup_read DESC
LIMIT 10;Tables with high seq_scan and high seq_tup_read are candidates for index creation. Tables with high n_tup_upd and n_tup_del generate dead tuples and need regular vacuuming.
-- Check for tables needing vacuum
SELECT relname, n_dead_tup, last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;MySQL: I/O Statistics#
-- Table I/O waits
SELECT object_schema, object_name,
count_read, count_write,
sum_timer_read / 1e12 AS read_sec,
sum_timer_write / 1e12 AS write_sec
FROM performance_schema.table_io_waits_summary_by_table
WHERE object_schema NOT IN ('mysql', 'performance_schema', 'sys')
ORDER BY sum_timer_wait DESC
LIMIT 10;Phase 6 – Check Buffer and Cache Hit Ratios#
Low cache hit ratios mean the database is reading from disk when it should be reading from memory.
PostgreSQL: Buffer Cache Hit Ratio#
-- Overall hit ratio (should be > 99% for OLTP)
SELECT
round(sum(blks_hit)::numeric / (sum(blks_hit) + sum(blks_read)) * 100, 2) AS hit_ratio
FROM pg_stat_database
WHERE datname = current_database();If below 99%, shared_buffers is likely too small. Increase it (standard guidance: 25% of total RAM). If a specific table has a low hit ratio, consider whether the working set fits in memory – options are more RAM, partitioning, or archiving old data.
MySQL: InnoDB Buffer Pool Hit Ratio#
SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests'; -- logical reads (from cache)
SHOW STATUS LIKE 'Innodb_buffer_pool_reads'; -- physical disk reads
-- hit_ratio = 1 - (reads / read_requests). Should be > 99%.If below 99%, increase innodb_buffer_pool_size. On MySQL 8.x, this can be done dynamically:
SET GLOBAL innodb_buffer_pool_size = 8 * 1024 * 1024 * 1024; -- 8 GBSummary: Investigation Order#
- Slow queries (pg_stat_statements / Performance Schema) – find the top offenders.
- Execution plans (EXPLAIN ANALYZE) – understand why they are slow.
- Lock contention (pg_locks / innodb_lock_waits) – rule out waiting as the cause.
- Connection pool (pg_stat_activity / processlist) – ensure connections are not exhausted.
- I/O patterns (iostat + database I/O stats) – check disk saturation.
- Cache hit ratios (buffer cache stats) – verify the working set fits in memory.
Most investigations end at step 1 or 2. A missing index or a bad query plan is the cause in the majority of cases. Steps 3 through 6 matter when the queries themselves are well-optimized but the system is still slow.