Linux Troubleshooting: A Systematic Approach to Diagnosing System Issues

The USE Method: A Framework for Systematic Diagnosis#

The USE method, developed by Brendan Gregg, provides a structured approach to system performance analysis. For every resource on the system – CPU, memory, disk, network – you check three things:

Utilization: How busy is the resource? (e.g., CPU at 90%)
Saturation: Is work queuing because the resource is overloaded? (e.g., CPU run queue length)
Errors: Are there error events? (e.g., disk I/O errors, network packet drops)

This method prevents the common trap of randomly checking things. Instead, you systematically walk through each resource and check all three dimensions. If you find high utilization, saturation, or errors on a resource, you have found your bottleneck.

The recommended investigation order is: CPU, Memory, Disk, Network, Processes, Logs. This order works because CPU and memory issues are the most common, and each step builds context for the next.

CPU Investigation#

Start with the big picture using top or htop:

top -bn1 | head -20          # snapshot view, non-interactive
htop                          # interactive, color-coded, tree view

Key things to look at in top: the load average (1, 5, 15 minute averages), overall CPU percentages (us = user, sy = system, wa = I/O wait, id = idle), and per-process CPU usage. The load average represents the number of processes waiting for CPU time. On a 4-core system, a load average of 4.0 means the CPUs are fully utilized. Above that means processes are queuing.

A critical pattern to recognize: high load average but low CPU usage. This means processes are waiting but not for CPU – they are in I/O wait or uninterruptible sleep. Check the wa (I/O wait) value in top. If wa is high, the bottleneck is disk I/O, not CPU.

For per-CPU breakdown:

mpstat -P ALL 1 5             # per-CPU stats, 1-second interval, 5 samples

This reveals if one CPU is pegged at 100% while others are idle – a sign of a single-threaded bottleneck. It also shows if the system is spending excessive time in system calls (%sys) vs user code (%usr).

To identify which process is consuming CPU:

pidstat 1 5                   # per-process CPU, 1-second interval
pidstat -t -p <pid> 1         # per-thread breakdown for a specific process

Memory Investigation#

The most important command for memory is free:

free -h

This produces output like:

              total        used        free      shared  buff/cache   available
Mem:           31Gi        12Gi       1.2Gi       256Mi        18Gi        18Gi

The critical column is available, not free. Linux uses unused memory for disk caching (buff/cache), which is reclaimed when applications need it. A system showing 1.2G “free” but 18G “available” is healthy – the kernel is using spare memory productively. Only worry when “available” is low.

Check for swap activity with vmstat:

vmstat 1 10                   # 1-second interval, 10 samples

Watch the si (swap in) and so (swap out) columns. Any non-zero so value means the system is actively pushing memory to disk, which devastates performance. Consistent swap activity is a strong signal that the system needs more RAM or a process has a memory leak.

For detailed memory breakdown:

cat /proc/meminfo             # full kernel memory statistics
smem -tk                      # per-process memory (USS = unique, PSS = proportional)

smem is particularly useful because it shows actual per-process memory consumption, accounting for shared libraries. The PSS (Proportional Set Size) column divides shared memory proportionally among the processes sharing it, giving a realistic picture.

Disk Investigation#

Disk problems come in two forms: running out of space and I/O performance issues.

For space:

df -h                         # filesystem space usage
df -i                         # inode usage -- critical and often overlooked

Inodes can run out before disk space. A filesystem with millions of tiny files (common with mail servers, container layers, or build caches) can exhaust inodes while gigabytes of space remain. The symptom is “No space left on device” errors despite df -h showing available space. Always check df -i when you see space errors.

For I/O performance:

iostat -x 1 5                 # extended I/O stats, 1-second interval

Key columns: await (average I/O request wait time in ms – should be under 10ms for SSDs, under 20ms for HDDs), %util (device utilization – 100% means saturated), and avgqu-sz (average queue size – high values mean I/O is queuing).

To identify which process is causing I/O:

iotop -oP                     # show only processes doing I/O, per-process

Network Investigation#

Start with what is listening and connected:

ss -tlnp                      # TCP listening ports with process names
ss -s                         # connection state summary (established, TIME_WAIT, etc.)
ss -tnp state established     # all established connections

A high number of TIME_WAIT connections can indicate connection churn. Thousands of CLOSE_WAIT connections indicate a process that is not properly closing sockets – typically an application bug.

For bandwidth investigation:

iftop -i eth0                 # real-time bandwidth per connection
nethogs eth0                  # bandwidth per process (more useful)

When you need to see actual packet content:

tcpdump -i eth0 port 80 -nn -c 100    # capture 100 packets on port 80
tcpdump -i any host 10.0.0.5 -w /tmp/capture.pcap   # write to file for Wireshark

Process Investigation#

When you have identified a suspect process:

ps aux --sort=-%mem | head -20          # top 20 processes by memory
ps aux --sort=-%cpu | head -20          # top 20 processes by CPU
ps -eo pid,ppid,stat,cmd --forest      # process tree showing parent-child relationships

To see what a process is doing at the system call level:

strace -p <pid> -c                     # syscall summary (count and time per call)
strace -p <pid> -e trace=network       # only network-related syscalls
strace -p <pid> -e trace=file          # only file-related syscalls

To see what files and sockets a process has open:

lsof -p <pid>                          # all open files, sockets, pipes
lsof -i :8080                          # which process is using port 8080
lsof +D /var/log                       # which processes have files open in /var/log

Log Investigation#

Logs are where you confirm what the metrics are telling you:

journalctl -u <service> --since "30 min ago"   # recent service logs
journalctl -u <service> -p err                  # errors only
journalctl -f                                    # follow all system logs
dmesg --time-format iso | tail -100              # recent kernel messages
dmesg -T | grep -i error                         # kernel errors with human timestamps

dmesg is especially important for: OOM kills, disk errors, hardware failures, and filesystem issues. These never appear in application logs.

Check standard log locations when journalctl does not have what you need:

/var/log/syslog          # general system log (Debian/Ubuntu)
/var/log/messages        # general system log (RHEL/CentOS)
/var/log/auth.log        # authentication events
/var/log/kern.log        # kernel messages

Common Patterns and Their Diagnosis#

OOM Killer: The kernel kills processes when memory is exhausted. Detect with:

dmesg | grep -i "oom\|out of memory"
journalctl -k | grep -i oom

The kernel log shows which process was killed and how much memory it was using. The OOM killer selects victims based on an oom_score – processes using more memory get higher scores.

Disk Full (Including Inodes): When df -h shows space but operations fail, check inodes with df -i. Also check if a deleted file is still held open by a process:

lsof +L1                # files that have been deleted but are still open

A common scenario: you delete a large log file, but the process still holds it open. The space is not freed until the process closes the file or is restarted.

Zombie Processes: Processes that have exited but whose parent has not read their exit status. They show as Z in ps. They consume no resources but do consume a PID slot. Many zombies indicate a buggy parent process.

ps aux | awk '$8 ~ /Z/ {print}'        # find zombie processes

File Descriptor Exhaustion: Processes or the system running out of file descriptors:

ulimit -n                              # current per-process limit
cat /proc/sys/fs/file-nr               # system-wide: allocated, unused, max
ls /proc/<pid>/fd | wc -l              # how many FDs a process has open

The “It’s Slow” Investigation#

When the symptom is simply “it’s slow,” determine which resource is the bottleneck:

CPU bound: top shows high CPU usage, low wa. The application is compute-limited. Solutions: optimize code, add CPU cores, distribute load.
I/O bound: top shows high wa (I/O wait), iostat shows high await and %util. The application is waiting on disk. Solutions: faster disks (SSD/NVMe), reduce I/O (caching, better queries), spread I/O across disks.
Memory bound: high swap activity (vmstat si/so), low “available” in free -h. The system is thrashing. Solutions: add RAM, reduce memory usage, fix memory leaks.
Network bound: nethogs or iftop shows high bandwidth, or ss shows many connections in unusual states. Solutions: increase bandwidth, optimize payload sizes, add connection pooling.

Run through this checklist in order. The first resource showing saturation or high utilization is usually your primary bottleneck. Fix that first, then re-evaluate – fixing one bottleneck often reveals the next.