Troubleshooting


1. CPU High#

一句话:先找进程,再找线程,最后看是计算、IO wait 还是系统调用。

# Show live CPU, memory, load, and top processes.
top

# List the highest CPU processes.
ps aux --sort=-%cpu | head -20

# Show per-process CPU usage every second.
pidstat -u -p <pid> 1

# Show CPU usage by thread for one process.
ps -T -p <pid> -o pid,tid,pcpu,pmem,comm

# Attach to a process and see frequent syscalls or blocking calls.
strace -tt -p <pid>

判断:

Signal Meaning
us high application code is busy
sy high kernel / syscall heavy
wa high waiting for disk IO
one thread high single hot thread / loop

2. Memory High#

一句话:先确认 available 是否低,不要只看 used。

# Show memory usage, available memory, cache, and swap.
free -h

# Show memory pressure, swap activity, and run queue every second.
vmstat 1

# List the highest memory processes.
ps aux --sort=-%mem | head -20

# Check whether the kernel or cgroup killed a process because of OOM.
dmesg -T | grep -i -E 'oom|killed process'

# Show enabled swap devices and current swap usage.
swapon --show

判断:

Signal Meaning
available low real memory pressure
buff/cache high usually page cache, not always bad
swap in/out high memory pressure already hurts latency
OOM log exists process was killed by kernel or cgroup

3. Disk Full#

一句话:同时查 filesystem、inode、deleted files。

# Show filesystem space usage.
df -h

# Show inode usage for filesystems.
df -ih

# Find large first-level directories under /var.
du -h --max-depth=1 /var | sort -h

# Find files larger than 1 GiB under /var.
find /var -type f -size +1G -ls

# Find deleted files that are still held open by processes.
lsof +L1

判断:

Signal Meaning
df -h 100% filesystem full
df -ih 100% too many small files
lsof +L1 has large files deleted file still held by process
/var/log huge log rotation / retention issue

4. Disk IO High#

一句话:先看磁盘是否繁忙,再定位哪个进程在读写。

# Show disk utilization, queueing, and latency every second.
iostat -xz 1

# Show processes currently doing disk IO.
iotop -oPa

# Show per-process disk read/write activity every second.
pidstat -d 1

# Show block devices, filesystems, and mount targets.
lsblk -f

判断:

Signal Meaning
%util near 100 device saturated
await high IO latency high
one process high write log / batch / database write pressure
many random reads cache miss or query pattern issue

5. Port Not Reachable#

一句话:本机看监听,对端看连接,中间看路由、防火墙、安全组。

# Show listening TCP ports and owning processes.
ss -lntp

# Find which process is using a specific port.
lsof -i :<port>

# Test whether the service works from the same host.
curl -v http://127.0.0.1:<port>/

# Test whether the service works through the target host or IP.
curl -v http://<host>:<port>/

# Show routing table and default gateway.
ip route

# Capture packets for the target port to see whether traffic reaches the host.
tcpdump -i <iface> port <port>

判断:

Signal Meaning
no listener service did not bind the port
listener on 127.0.0.1 only remote host cannot access it
SYN no reply firewall / route / security group issue
connection reset app or proxy actively rejected it

6. Service Failed#

一句话:systemd 服务失败先看 status,再看 journal,再看 unit 和环境变量。

# Show service state, exit code, recent logs, and restart status.
systemctl status <service>

# Show recent journal logs for the service.
journalctl -u <service> -n 200

# Follow service logs in real time.
journalctl -u <service> -f

# Show the effective systemd unit and override files.
systemctl cat <service>

# Show environment variables configured in systemd for the service.
systemctl show <service> --property=Environment

# Clear systemd failed state after fixing the service.
systemctl reset-failed <service>

判断:

Signal Meaning
exit code non-zero app startup failed
permission denied user / file / capability issue
address already in use port conflict
restart loop dependency, config, or health check issue

7. DNS / TLS Failed#

一句话:DNS 先查解析链路,TLS 先查证书、SNI 和时间。

# Check whether the current machine's default DNS resolver can resolve the domain.
dig <domain>

# Bypass the local resolver and compare the answer with public DNS.
dig @8.8.8.8 <domain>

# Check the full HTTP/TLS path: DNS, TCP connect, TLS handshake, cert validation, and HTTP status.
curl -v https://<domain>/

# Check TLS handshake, certificate chain, SAN/CN, expiry, and SNI behavior.
openssl s_client -connect <host>:443 -servername <domain>

# Check whether system time is obviously wrong.
date

# Check timezone, NTP status, and whether system time is synchronized.
timedatectl

时间和 TLS 的关系:

Check Why it matters
system time too early 证书可能被判断为 not yet valid。
system time too late 证书可能被判断为 expired。
NTP disabled 时间会慢慢漂移,间歇性导致 TLS、JWT、签名请求、日志时间线异常。
timezone wrong 通常不影响 TLS 校验,但会影响日志排查和告警时间判断。

判断:

Signal Meaning
local DNS fails only resolver or /etc/resolv.conf issue
public DNS fails domain / record issue
cert name mismatch wrong certificate or missing SNI
cert expired certificate renewal issue
system time wrong TLS validation may fail