Linux Debugging Essentials for Infrastructure

Debugging Workflow#

Start broad, narrow down. Most problems fall into five categories: service not running, resource exhaustion, full disk, network failure, or kernel issue. Work through them in order: service, resources, network, kernel logs.

Services: systemctl and journalctl#

When a service is misbehaving, start with its status:

systemctl status nginx

This shows whether the service is active, its PID, its last few log lines, and how long it has been running. If the service keeps restarting, the uptime will be suspiciously short.

View full logs for a service:

journalctl -u nginx -b              # logs since last boot
journalctl -u nginx -f              # follow in real time
journalctl -u nginx -p err          # only errors and above
journalctl -u nginx --since "1 hour ago"  # time-scoped

If a service fails to start, check the exit code in systemctl status. Common patterns: exit code 1 means configuration error, 137 means OOM-killed, 203 means the binary was not found. Restart with systemctl restart nginx, enable on boot with systemctl enable nginx, and run systemctl daemon-reload after editing a unit file.

Kernel Messages: dmesg#

When things go wrong at the system level, dmesg shows kernel ring buffer messages. OOM kills, hardware errors, filesystem issues, and driver problems all appear here.

# Recent kernel messages
dmesg --time-format iso | tail -50

# Follow new messages
dmesg -w

# Filter for OOM events
dmesg | grep -i "oom\|out of memory\|killed process"

# Disk/filesystem errors
dmesg | grep -i "error\|fail\|ext4\|xfs"

If a process was OOM-killed, dmesg shows which process was chosen. The kernel picks the process with the highest oom_score.

Processes: top, htop, ps#

Identify what is consuming CPU and memory:

# Snapshot of top processes by CPU
ps aux --sort=-%cpu | head -20

# Snapshot by memory
ps aux --sort=-%mem | head -20

# Find a specific process
ps aux | grep '[n]ginx'

# Process tree (parent-child relationships)
ps auxf

htop provides an interactive view with per-core CPU graphs and sortable columns. Use top -b -n 1 for non-interactive output suitable for scripts. For deeper per-process inspection, look at /proc/PID/status for memory details and /proc/PID/fd for open file descriptors.

Disk: df and du#

A full filesystem causes cascading failures – services cannot write logs, databases cannot write data, package managers refuse to install updates.

df -h                                   # filesystem usage
df -i                                   # inode usage (can fill even with free space)
du -sh /var/* | sort -rh | head -10     # largest directories
find / -type f -size +100M 2>/dev/null  # large files

Common culprits: unrotated logs in /var/log, old Docker images (docker system prune), package manager cache (apt clean), and core dumps.

Memory: free and vmstat#

free -h      # memory overview
vmstat 2     # continuous monitoring (every 2 seconds)

In free output, the “available” column matters, not “free.” Linux uses unused memory for disk cache, which is reclaimed on demand. A system showing 50MB “free” but 4GB “available” is healthy. In vmstat, watch si/so (swap in/out – constant activity means memory starvation) and wa (I/O wait).

Networking: ss and netstat#

ss is the modern replacement for netstat.

ss -tlnp                       # listening TCP ports with process names
ss -tnp                        # established connections
ss -tnp dst :443               # connections to a specific port
ss -tn state time-wait | wc -l # TIME_WAIT count
ss -s                          # socket statistics summary

The -p flag shows the owning process (requires root). For connectivity testing: dig example.com +short for DNS, nc -zv host 443 -w 5 for port reachability, traceroute -n host for routing.

System Calls: strace#

When logs tell you nothing, strace shows exactly what system calls a process is making – file access errors, network connection attempts, and permission denials that never appear in application logs.

# Trace a running process
strace -p PID -f

# Trace a command from start
strace -f -e trace=network,file curl https://example.com

# Only file operations
strace -e trace=open,openat,read,write -p PID

# Trace with timestamps
strace -t -p PID

# Summary of syscalls (count and time spent)
strace -c -p PID

The -f flag follows child processes (critical for forking servers). The -e trace= flag filters to specific syscall categories to reduce noise.

Open Files: lsof#

lsof connects processes to the files and sockets they hold open.

# Files opened by a process
lsof -p PID

# What process has a file open
lsof /var/log/syslog

# All network connections for a process
lsof -i -a -p PID

# What is using a specific port
lsof -i :8080

# Files opened by a user
lsof -u www-data

# Count open files per process (find file descriptor leaks)
lsof | awk '{print $2}' | sort | uniq -c | sort -rn | head -10

A process that continuously opens files without closing them hits the file descriptor limit (ulimit -n), causing “too many open files” errors. The count command above identifies the offending process.