sysctl: Kernel Parameter Tuning#
The sysctl interface exposes kernel parameters that control how Linux manages memory, networking, file systems, and processes. Changes take effect immediately but are lost on reboot unless persisted.
Memory Parameters#
# Reduce swap aggressiveness (default is 60, range 0-100)
# Lower values make the kernel prefer reclaiming page cache over swapping
# Set to 10 for database servers -- swapping destroys database performance
sysctl -w vm.swappiness=10
# Overcommit behavior
# 0 = heuristic overcommit (default, kernel estimates if there is enough memory)
# 1 = always overcommit (never refuse malloc -- dangerous but used by Redis)
# 2 = strict overcommit (never allocate more than swap + ratio*physical)
sysctl -w vm.overcommit_memory=0The vm.swappiness parameter is one of the most impactful settings for database servers. The default of 60 means the kernel will fairly aggressively swap application memory to disk in favor of filesystem cache. For databases that manage their own caching (PostgreSQL shared_buffers, MySQL innodb_buffer_pool), this is counterproductive – the database’s carefully managed cache gets swapped out to make room for OS-level cache the database does not use.
Network Parameters#
# TCP listen backlog -- how many connections can queue while waiting for accept()
# Default 4096 is too low for high-connection servers (web servers, load balancers)
sysctl -w net.core.somaxconn=65535
# SYN queue size -- pending connections in the SYN_RECV state
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# Reuse TIME_WAIT sockets for new connections from the same source
# Helpful for connection-heavy apps, reverse proxies, microservices
sysctl -w net.ipv4.tcp_tw_reuse=1
# Socket buffer sizes (bytes) -- increase for high-throughput workloads
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
# TCP buffer auto-tuning range: min, default, max (bytes)
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"The net.core.somaxconn setting is a frequent cause of connection failures under load. When a server’s accept queue is full, new connections get dropped silently. Applications see this as connection timeouts or resets, and the root cause is not obvious without knowing to check this parameter. Nginx, for example, recommends setting the backlog to at least the value of its own worker_connections.
File System Parameters#
# System-wide maximum file descriptors
sysctl -w fs.file-max=2097152
# inotify watches -- critical for file watchers, Kubernetes, IDE tools
# Default 8192 is too low for Kubernetes nodes or development machines
sysctl -w fs.inotify.max_user_watches=524288
sysctl -w fs.inotify.max_user_instances=1024Running out of inotify watches produces cryptic errors. Kubernetes uses inotify extensively for watching ConfigMaps and Secrets mounted as volumes. Development tools like webpack, VSCode, and IntelliJ also consume watches. On a Kubernetes node running dozens of pods, the default limit is quickly exhausted.
Persisting Changes#
# Apply changes from all config files
sysctl -p
# Create persistent configuration
cat > /etc/sysctl.d/99-custom.conf << 'EOF'
vm.swappiness = 10
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_tw_reuse = 1
fs.file-max = 2097152
fs.inotify.max_user_watches = 524288
EOF
sysctl --system # reload all sysctl config filesulimits: Per-Process Resource Limits#
Linux enforces per-process limits on resources like open files and processes. These limits come in two forms: soft (the current effective limit, which a process can raise) and hard (the ceiling that only root can raise).
# Check current limits
ulimit -a # all limits for current shell
ulimit -n # open file descriptor limit
cat /proc/<pid>/limits # limits for a running processConfiguring Limits#
For traditional init systems, edit /etc/security/limits.conf:
# /etc/security/limits.conf
* soft nofile 65535
* hard nofile 65535
* soft nproc 65535
* hard nproc 65535For systemd services, set limits in the unit file:
[Service]
LimitNOFILE=65535
LimitNPROC=65535
LimitCORE=infinity # allow core dumpsThe systemd approach is preferable because it is explicit per-service and does not rely on PAM modules loading correctly. A common failure mode is setting limits in limits.conf but the service starting before PAM processes the file, resulting in the default limits being applied.
I/O Schedulers#
The I/O scheduler determines how the kernel orders and merges disk I/O requests. The right scheduler depends on your storage hardware.
# Check current scheduler (the active one is in brackets)
cat /sys/block/sda/queue/scheduler
# Output example: [mq-deadline] kyber bfq noneAvailable schedulers:
- none (also called noop): No reordering. Best for NVMe and SSDs, which have their own internal scheduling. Adding kernel-level scheduling on top of the device’s scheduling just adds latency.
- mq-deadline: Ensures every request gets serviced within a deadline. Good balance for mixed workloads on SSDs or fast HDDs. Prevents starvation of reads by writes.
- bfq (Budget Fair Queueing): Allocates I/O bandwidth fairly among processes. Best for interactive/desktop use where you want responsive I/O even when background tasks are running.
# Set scheduler for a device (non-persistent)
echo none > /sys/block/nvme0n1/queue/scheduler
# Persistent via udev rule
cat > /etc/udev/rules.d/60-scheduler.rules << 'EOF'
ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/scheduler}="mq-deadline"
EOFRule of thumb: Use none for NVMe, mq-deadline for SATA SSDs and HDDs, and bfq only for desktop/interactive workloads.
Transparent Huge Pages (THP)#
Linux can automatically use 2MB “huge pages” instead of the standard 4KB pages, reducing TLB (Translation Lookaside Buffer) misses. This sounds beneficial, but it causes serious problems for databases.
The issue is that THP triggers background compaction – the kernel rearranges memory to create contiguous 2MB blocks. This causes unpredictable latency spikes. Redis, MongoDB, PostgreSQL, and many other databases explicitly recommend disabling THP.
# Check current state
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: [always] madvise never
# Disable (non-persistent)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Persistent via systemd service
cat > /etc/systemd/system/disable-thp.service << 'EOF'
[Unit]
Description=Disable Transparent Huge Pages
DefaultDependencies=no
After=sysinit.target local-fs.target
Before=basic.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && echo never > /sys/kernel/mm/transparent_hugepage/defrag'
[Install]
WantedBy=basic.target
EOF
systemctl daemon-reload && systemctl enable disable-thpCPU Governors#
CPU frequency governors control how the CPU scales its clock speed:
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Set governor (all CPUs)
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor- performance: Always run at maximum frequency. Use for latency-sensitive workloads (databases, real-time processing). Higher power consumption.
- powersave: Always run at minimum frequency. Use only when power conservation is critical.
- ondemand: Scale frequency based on load (legacy). Reacts to load after it happens.
- schedutil: Scale frequency based on scheduler utilization data (modern, recommended default). More responsive than ondemand because it uses scheduler information directly.
For production servers running latency-sensitive workloads, performance eliminates the latency of CPU frequency scaling. The power cost is marginal on servers that are already running hot.
Dirty Page Management#
When applications write to files, the data goes to memory (page cache) first and is written to disk later by background threads. The “dirty” parameters control when this flushing happens:
# Percentage of total memory that can be dirty before background flushing starts
sysctl -w vm.dirty_background_ratio=5 # default 10
# Percentage of total memory that can be dirty before processes are forced to wait
sysctl -w vm.dirty_ratio=10 # default 20Lower values mean more frequent, smaller flushes. Higher values mean less frequent, larger flushes. For write-heavy workloads (logging, data ingestion), lower values prevent the “thundering herd” effect where the system suddenly blocks all writers to flush a massive backlog. For systems with battery-backed write caches, higher values can improve throughput.
Network Tuning for High Throughput#
TCP Congestion Control#
# Check available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# BBR (Bottleneck Bandwidth and RTT) -- Google's congestion control
# Better throughput and lower latency than cubic on most networks
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq # BBR requires fair queuingBBR typically provides better throughput than the default cubic algorithm, especially on high-bandwidth, high-latency links (cloud environments, cross-datacenter traffic). It measures the actual bottleneck bandwidth and RTT rather than relying on packet loss as a congestion signal.
TCP Window Scaling#
TCP window scaling is enabled by default on modern Linux but verify it:
sysctl net.ipv4.tcp_window_scaling # should be 1Kubernetes Node Tuning#
Kubernetes nodes need specific tuning beyond default Linux settings:
# /etc/sysctl.d/99-kubernetes-node.conf
# File descriptors -- pods consume many
fs.file-max = 2097152
# inotify -- Kubernetes watches ConfigMaps, Secrets as volumes
fs.inotify.max_user_watches = 1048576
fs.inotify.max_user_instances = 4096
# Connection tracking -- conntrack table for iptables/nftables rules
# Default 65536 is too low for nodes with many Services
net.netfilter.nf_conntrack_max = 1048576
# Network -- high pod-to-pod and pod-to-service traffic
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.ip_local_port_range = 1024 65535
# Allow more connections in TIME_WAIT
net.ipv4.tcp_tw_reuse = 1
# ARP table size -- large clusters have many nodes/pods
net.ipv4.neigh.default.gc_thresh1 = 4096
net.ipv4.neigh.default.gc_thresh2 = 8192
net.ipv4.neigh.default.gc_thresh3 = 16384The nf_conntrack_max setting deserves special attention. Every connection through a Kubernetes Service creates a conntrack entry. On busy nodes with many services, the default conntrack table fills up, causing new connections to be silently dropped. The symptom is intermittent connection failures that are difficult to diagnose without checking dmesg for conntrack table full messages.
In OpenShift environments, the Node Tuning Operator (part of the Tuned project) can apply sysctl settings declaratively using custom resources, ensuring consistent configuration across all nodes without manual SSH access.