1. CPU High#
一句话:先找进程,再找线程,最后看是计算、IO wait 还是系统调用。
# Show live CPU, memory, load, and top processes.
top
# List the highest CPU processes.
ps aux --sort=-%cpu | head -20
# Show per-process CPU usage every second.
pidstat -u -p <pid> 1
# Show CPU usage by thread for one process.
ps -T -p <pid> -o pid,tid,pcpu,pmem,comm
# Attach to a process and see frequent syscalls or blocking calls.
strace -tt -p <pid>判断:
| Signal | Meaning |
|---|---|
us high |
application code is busy |
sy high |
kernel / syscall heavy |
wa high |
waiting for disk IO |
| one thread high | single hot thread / loop |
2. Memory High#
一句话:先确认 available 是否低,不要只看 used。
# Show memory usage, available memory, cache, and swap.
free -h
# Show memory pressure, swap activity, and run queue every second.
vmstat 1
# List the highest memory processes.
ps aux --sort=-%mem | head -20
# Check whether the kernel or cgroup killed a process because of OOM.
dmesg -T | grep -i -E 'oom|killed process'
# Show enabled swap devices and current swap usage.
swapon --show判断:
| Signal | Meaning |
|---|---|
available low |
real memory pressure |
buff/cache high |
usually page cache, not always bad |
| swap in/out high | memory pressure already hurts latency |
| OOM log exists | process was killed by kernel or cgroup |
3. Disk Full#
一句话:同时查 filesystem、inode、deleted files。
# Show filesystem space usage.
df -h
# Show inode usage for filesystems.
df -ih
# Find large first-level directories under /var.
du -h --max-depth=1 /var | sort -h
# Find files larger than 1 GiB under /var.
find /var -type f -size +1G -ls
# Find deleted files that are still held open by processes.
lsof +L1判断:
| Signal | Meaning |
|---|---|
df -h 100% |
filesystem full |
df -ih 100% |
too many small files |
lsof +L1 has large files |
deleted file still held by process |
/var/log huge |
log rotation / retention issue |
4. Disk IO High#
一句话:先看磁盘是否繁忙,再定位哪个进程在读写。
# Show disk utilization, queueing, and latency every second.
iostat -xz 1
# Show processes currently doing disk IO.
iotop -oPa
# Show per-process disk read/write activity every second.
pidstat -d 1
# Show block devices, filesystems, and mount targets.
lsblk -f判断:
| Signal | Meaning |
|---|---|
%util near 100 |
device saturated |
await high |
IO latency high |
| one process high write | log / batch / database write pressure |
| many random reads | cache miss or query pattern issue |
5. Port Not Reachable#
一句话:本机看监听,对端看连接,中间看路由、防火墙、安全组。
# Show listening TCP ports and owning processes.
ss -lntp
# Find which process is using a specific port.
lsof -i :<port>
# Test whether the service works from the same host.
curl -v http://127.0.0.1:<port>/
# Test whether the service works through the target host or IP.
curl -v http://<host>:<port>/
# Show routing table and default gateway.
ip route
# Capture packets for the target port to see whether traffic reaches the host.
tcpdump -i <iface> port <port>判断:
| Signal | Meaning |
|---|---|
| no listener | service did not bind the port |
listener on 127.0.0.1 only |
remote host cannot access it |
| SYN no reply | firewall / route / security group issue |
| connection reset | app or proxy actively rejected it |
6. Service Failed#
一句话:systemd 服务失败先看 status,再看 journal,再看 unit 和环境变量。
# Show service state, exit code, recent logs, and restart status.
systemctl status <service>
# Show recent journal logs for the service.
journalctl -u <service> -n 200
# Follow service logs in real time.
journalctl -u <service> -f
# Show the effective systemd unit and override files.
systemctl cat <service>
# Show environment variables configured in systemd for the service.
systemctl show <service> --property=Environment
# Clear systemd failed state after fixing the service.
systemctl reset-failed <service>判断:
| Signal | Meaning |
|---|---|
| exit code non-zero | app startup failed |
| permission denied | user / file / capability issue |
| address already in use | port conflict |
| restart loop | dependency, config, or health check issue |
7. DNS / TLS Failed#
一句话:DNS 先查解析链路,TLS 先查证书、SNI 和时间。
# Check whether the current machine's default DNS resolver can resolve the domain.
dig <domain>
# Bypass the local resolver and compare the answer with public DNS.
dig @8.8.8.8 <domain>
# Check the full HTTP/TLS path: DNS, TCP connect, TLS handshake, cert validation, and HTTP status.
curl -v https://<domain>/
# Check TLS handshake, certificate chain, SAN/CN, expiry, and SNI behavior.
openssl s_client -connect <host>:443 -servername <domain>
# Check whether system time is obviously wrong.
date
# Check timezone, NTP status, and whether system time is synchronized.
timedatectl时间和 TLS 的关系:
| Check | Why it matters |
|---|---|
| system time too early | 证书可能被判断为 not yet valid。 |
| system time too late | 证书可能被判断为 expired。 |
| NTP disabled | 时间会慢慢漂移,间歇性导致 TLS、JWT、签名请求、日志时间线异常。 |
| timezone wrong | 通常不影响 TLS 校验,但会影响日志排查和告警时间判断。 |
判断:
| Signal | Meaning |
|---|---|
| local DNS fails only | resolver or /etc/resolv.conf issue |
| public DNS fails | domain / record issue |
| cert name mismatch | wrong certificate or missing SNI |
| cert expired | certificate renewal issue |
| system time wrong | TLS validation may fail |