Node Liveness Issues#
Every node must renew its liveness record every 4.5 seconds. Failure to renew marks the node suspect, then dead, triggering re-replication of its ranges.
cockroach node status --insecure --host=localhost:26257Look at is_live. If a node shows false, check in order:
Process crashed. Check cockroach-data/logs/ for fatal or panic entries. OOM kills are the most common cause – check dmesg | grep -i oom on the host.
Network partition. The node runs but cannot reach peers. If cockroach node status succeeds locally but fails from other nodes, the problem is network-level (firewalls, security groups, DNS).
Disk stall. CockroachDB logs disk stall detected when storage I/O blocks liveness renewal:
SELECT node_id, store_id, available, used, capacity
FROM crdb_internal.kv_store_status;Under-Replicated Ranges#
Ranges with fewer replicas than the replication factor. Normal briefly during failures, but persistent under-replication needs attention.
SELECT range_id, start_key, end_key, replicas, lease_holder
FROM crdb_internal.ranges_no_leases
WHERE array_length(replicas, 1) < 3;Common causes: too few nodes for the replication factor, an incompletely drained decommission, or zone constraints that cannot be satisfied. If a node was lost permanently, re-replication starts after server.time_until_store_dead (default 5 minutes). If under-replication persists, check for stale locality constraints:
SHOW ZONE CONFIGURATIONS;Slow Queries: EXPLAIN ANALYZE#
CockroachDB distributes query execution across nodes. EXPLAIN ANALYZE shows the distributed execution plan with per-node timing.
EXPLAIN ANALYZE SELECT o.id, o.total, c.name
FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2026-01-01' AND o.total > 100
ORDER BY o.total DESC LIMIT 50;Key things to look for:
- Full table scans.
spans: FULL SCANmeans no index is being used. Add an appropriate index. - High intermediate row counts. A join reading millions of rows to return 50 means filters are applied too late.
- Network overhead. Disproportionate
KV rows readon one node suggests data locality issues. - Contention time. Non-zero values mean the query waited on locks from other transactions.
Find the slowest queries cluster-wide:
SELECT fingerprint_id, metadata ->> 'query' AS query,
statistics -> 'statistics' -> 'latencyInfo' ->> 'p99' AS p99_latency,
statistics -> 'statistics' ->> 'cnt' AS exec_count
FROM crdb_internal.statement_statistics
ORDER BY (statistics -> 'statistics' -> 'latencyInfo' ->> 'p99')::FLOAT DESC
LIMIT 20;Transaction Retry Errors (SQLSTATE 40001)#
Serializable isolation means conflicting transactions produce 40001 errors that must be retried. This is expected behavior, not a bug. Your application must implement retry loops:
import psycopg2
import time
def run_transaction(conn, callback, max_retries=5):
for attempt in range(max_retries):
try:
with conn.cursor() as cur:
callback(cur)
conn.commit()
return
except psycopg2.errors.SerializationFailure:
conn.rollback()
sleep_time = (2 ** attempt) * 0.01 # exponential backoff
time.sleep(sleep_time)
raise Exception("Transaction failed after max retries")
def transfer_funds(cur):
cur.execute("UPDATE accounts SET balance = balance - 100 WHERE id = 1")
cur.execute("UPDATE accounts SET balance = balance + 100 WHERE id = 2")
conn = psycopg2.connect("postgresql://appuser@localhost:26257/myapp?sslmode=disable")
conn.set_session(autocommit=False)
run_transaction(conn, transfer_funds)If you see high retry rates, investigate contention:
-- Find transactions with the most contention
SELECT fingerprint_id,
metadata ->> 'query' AS query,
statistics -> 'statistics' -> 'contentionTime' ->> 'mean' AS avg_contention
FROM crdb_internal.transaction_statistics
WHERE (statistics -> 'statistics' -> 'contentionTime' ->> 'mean')::FLOAT > 0
ORDER BY (statistics -> 'statistics' -> 'contentionTime' ->> 'mean')::FLOAT DESC
LIMIT 10;Reduce contention by keeping transactions small, avoiding long-running transactions, and not funneling concurrent updates through hot rows.
Hot Ranges#
A hot range receives disproportionate traffic, creating a bottleneck on the node holding its leaseholder. The DB Console “Hot Ranges” page shows the top ranges by QPS.
From SQL:
SELECT range_id, start_key, end_key, queries_per_second, lease_holder
FROM crdb_internal.ranges
ORDER BY queries_per_second DESC
LIMIT 10;Common causes and fixes:
- Sequential primary keys.
SERIALor auto-incrementing IDs cause all inserts to hit the last range. Fix: useUUIDprimary keys withgen_random_uuid(), or useHASH-sharded indexes. - Single-row hot spot. A counter or status row updated by every request. Fix: shard the counter across multiple rows and sum on read.
- Hash-sharded index for write-heavy tables:
CREATE TABLE events (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
created_at TIMESTAMPTZ DEFAULT now(),
data JSONB,
INDEX idx_events_created (created_at) USING HASH
);Clock Skew Issues#
CockroachDB requires synchronized clocks across nodes. The default maximum clock offset is 500ms. If a node’s clock drifts beyond this, it self-terminates to protect consistency.
Symptoms: a node crashes with clock synchronization error in logs, or transactions fail with timestamp-related errors.
# Check clock offset from the DB Console or via SQL
SELECT node_id, clock_offset_ns / 1e6 AS offset_ms
FROM crdb_internal.kv_node_status;Fix: Run NTP or chrony on every node. On Kubernetes, node clocks come from the host – ensure the host has NTP configured. On cloud VMs, the cloud provider typically handles this, but verify with timedatectl status.
Storage Capacity Problems#
CockroachDB uses a Pebble LSM-tree storage engine. When disk usage exceeds 95%, the node enters a ballast-recovery state and may refuse writes.
-- Check per-node storage
SELECT node_id, store_id,
(used::FLOAT / capacity::FLOAT * 100)::INT AS pct_used,
pg_size_pretty(available) AS available
FROM crdb_internal.kv_store_status;CockroachDB pre-allocates a 1 GiB ballast file (auxiliary/EMERGENCY_BALLAST). When the disk is full, delete this file to free space for recovery:
rm cockroach-data/auxiliary/EMERGENCY_BALLASTThen add storage, decommission the node, or drop unused tables/indexes. After cleanup, the node recreates the ballast on restart. Set alerts on capacity_available and note that space from deletes is reclaimed by compaction over time, not immediately.