Database Performance Investigation Runbook#

When a database is slow, resist the urge to immediately tune configuration parameters. Follow this sequence: identify what is slow, understand why, then fix the specific bottleneck. Most performance problems are caused by missing indexes or a single bad query, not global configuration issues.

Phase 1 – Identify Slow Queries#

The first step is always finding which queries are consuming the most time.

PostgreSQL: pg_stat_statements#

Enable the extension if not already loaded:

-- Check if loaded
SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements';

-- Enable (requires adding to shared_preload_libraries and restart)
ALTER SYSTEM SET shared_preload_libraries = 'pg_stat_statements';
-- Restart PostgreSQL, then:
CREATE EXTENSION pg_stat_statements;

Find the top queries by total time:

SELECT
  substring(query, 1, 80) AS query_preview,
  calls,
  round(total_exec_time::numeric, 2) AS total_ms,
  round(mean_exec_time::numeric, 2) AS avg_ms,
  round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct,
  rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

This gives you the top 10 queries ranked by cumulative execution time. Focus on these – they represent the biggest opportunities.

To reset stats after making changes (so you can measure the impact): SELECT pg_stat_statements_reset();

MySQL: Slow Query Log and Performance Schema#

Enable the slow query log dynamically:

SET GLOBAL slow_query_log = 1;
SET GLOBAL long_query_time = 0.5;
SET GLOBAL log_queries_not_using_indexes = 1;

Analyze accumulated slow queries with pt-query-digest:

pt-query-digest /var/log/mysql/slow.log --limit 10

Or query Performance Schema directly for the top queries by total execution time:

SELECT
  LEFT(DIGEST_TEXT, 80) AS query_preview,
  COUNT_STAR AS calls,
  ROUND(SUM_TIMER_WAIT / 1e12, 2) AS total_sec,
  ROUND(AVG_TIMER_WAIT / 1e12, 4) AS avg_sec,
  SUM_ROWS_EXAMINED AS rows_examined,
  SUM_ROWS_SENT AS rows_sent
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

A large gap between rows_examined and rows_sent signals a missing or ineffective index.

Phase 2 – Analyze Execution Plans#

Once you have the slow queries, examine their execution plans.

PostgreSQL: EXPLAIN (ANALYZE, BUFFERS)#

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
  AND o.created_at > now() - interval '7 days';

What to look for:

  • Seq Scan on a large table: Missing index. Check if a WHERE clause column or JOIN column is unindexed.
  • Nested Loop with high row count in outer loop: Consider if a Hash Join or Merge Join would be better. The planner may be choosing poorly due to stale statistics – run ANALYZE tablename;.
  • Sort with external merge: work_mem too low for this query, or add an index that matches the sort order.
  • Buffers: shared read much larger than shared hit: Data is not in the buffer cache. Either the table is too large for shared_buffers, or this is a cold query path.
-- Check if table statistics are current
SELECT relname, last_analyze, last_autoanalyze, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_dead_tup DESC;

-- Force statistics refresh
ANALYZE orders;

MySQL: EXPLAIN ANALYZE#

EXPLAIN ANALYZE
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
  AND o.created_at > DATE_SUB(NOW(), INTERVAL 7 DAY);

What to look for:

  • type: ALL: Full table scan. Needs an index on the filter columns.
  • type: index: Full index scan (reading every row of the index). Better than ALL but still expensive.
  • Extra: Using temporary; Using filesort: Query requires a temporary table and/or sort. Consider adding an index that covers the ORDER BY.
  • rows: Estimated rows to examine. Compare against actual rows returned. Large disparity means statistics are stale: ANALYZE TABLE orders;.

Phase 3 – Check Lock Contention#

Slow queries are not always caused by missing indexes. Lock contention can make fast queries wait.

PostgreSQL: Lock Investigation#

-- Find waiting queries and what they are waiting on
SELECT pid, wait_event_type, wait_event, state, query,
  age(now(), query_start) AS duration
FROM pg_stat_activity
WHERE wait_event IS NOT NULL AND state != 'idle'
ORDER BY query_start;

-- Find blocked/blocking pairs
SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query,
  blocking.pid AS blocking_pid, blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid
JOIN pg_locks kl ON kl.locktype = bl.locktype
  AND kl.relation IS NOT DISTINCT FROM bl.relation
  AND kl.pid != bl.pid
JOIN pg_stat_activity blocking ON blocking.pid = kl.pid
WHERE NOT bl.granted;

Common causes: long-running transactions holding RowExclusiveLock, ALTER TABLE holding AccessExclusiveLock, autovacuum blocking DDL.

MySQL: Lock Investigation#

-- Current lock waits
SELECT * FROM sys.innodb_lock_waits\G

-- If sys schema is unavailable
SELECT
  r.trx_id AS waiting_trx,
  r.trx_mysql_thread_id AS waiting_pid,
  r.trx_query AS waiting_query,
  b.trx_id AS blocking_trx,
  b.trx_mysql_thread_id AS blocking_pid,
  b.trx_query AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

-- Metadata locks (DDL vs DML contention)
SELECT * FROM performance_schema.metadata_locks
WHERE LOCK_STATUS = 'PENDING';

Common causes: long-running SELECT preventing ALTER TABLE, uncommitted transactions, large UPDATE or DELETE holding row locks.

Phase 4 – Review Connection Pool Saturation#

If the database itself is not slow but applications experience timeouts, the connection pool may be exhausted.

PostgreSQL#

SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
SHOW max_connections;

-- Idle in transaction is the worst offender -- holds locks while doing nothing
SELECT pid, usename, query, age(now(), xact_start) AS txn_duration
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start;

If idle in transaction persists, set idle_in_transaction_session_timeout = '30s'.

MySQL#

SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';

If Max_used_connections approaches max_connections, the fix is usually a connection pooler (PgBouncer, ProxySQL) rather than raising the limit.

Phase 5 – Analyze I/O Patterns#

If queries are well-indexed and there is no lock contention, the bottleneck may be disk I/O.

Operating System Level#

# I/O utilization per device
iostat -xz 2

# Key columns:
# %util: percentage of time device is busy (> 80% means saturation)
# await: average I/O wait time in ms (> 10ms for SSD = problem)
# r/s, w/s: reads and writes per second

PostgreSQL: I/O Statistics#

-- Table I/O: sequential vs index scans
SELECT relname,
  seq_scan, seq_tup_read,
  idx_scan, idx_tup_fetch,
  n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
ORDER BY seq_tup_read DESC
LIMIT 10;

Tables with high seq_scan and high seq_tup_read are candidates for index creation. Tables with high n_tup_upd and n_tup_del generate dead tuples and need regular vacuuming.

-- Check for tables needing vacuum
SELECT relname, n_dead_tup, last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

MySQL: I/O Statistics#

-- Table I/O waits
SELECT object_schema, object_name,
  count_read, count_write,
  sum_timer_read / 1e12 AS read_sec,
  sum_timer_write / 1e12 AS write_sec
FROM performance_schema.table_io_waits_summary_by_table
WHERE object_schema NOT IN ('mysql', 'performance_schema', 'sys')
ORDER BY sum_timer_wait DESC
LIMIT 10;

Phase 6 – Check Buffer and Cache Hit Ratios#

Low cache hit ratios mean the database is reading from disk when it should be reading from memory.

PostgreSQL: Buffer Cache Hit Ratio#

-- Overall hit ratio (should be > 99% for OLTP)
SELECT
  round(sum(blks_hit)::numeric / (sum(blks_hit) + sum(blks_read)) * 100, 2) AS hit_ratio
FROM pg_stat_database
WHERE datname = current_database();

If below 99%, shared_buffers is likely too small. Increase it (standard guidance: 25% of total RAM). If a specific table has a low hit ratio, consider whether the working set fits in memory – options are more RAM, partitioning, or archiving old data.

MySQL: InnoDB Buffer Pool Hit Ratio#

SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests';  -- logical reads (from cache)
SHOW STATUS LIKE 'Innodb_buffer_pool_reads';           -- physical disk reads
-- hit_ratio = 1 - (reads / read_requests). Should be > 99%.

If below 99%, increase innodb_buffer_pool_size. On MySQL 8.x, this can be done dynamically:

SET GLOBAL innodb_buffer_pool_size = 8 * 1024 * 1024 * 1024;  -- 8 GB

Summary: Investigation Order#

  1. Slow queries (pg_stat_statements / Performance Schema) – find the top offenders.
  2. Execution plans (EXPLAIN ANALYZE) – understand why they are slow.
  3. Lock contention (pg_locks / innodb_lock_waits) – rule out waiting as the cause.
  4. Connection pool (pg_stat_activity / processlist) – ensure connections are not exhausted.
  5. I/O patterns (iostat + database I/O stats) – check disk saturation.
  6. Cache hit ratios (buffer cache stats) – verify the working set fits in memory.

Most investigations end at step 1 or 2. A missing index or a bad query plan is the cause in the majority of cases. Steps 3 through 6 matter when the queries themselves are well-optimized but the system is still slow.