Commands


1. Process#

一句话:先找是谁在跑、占了什么资源、监听了什么端口。

Need Command
CPU top processes ps aux --sort=-%cpu | head
Memory top processes ps aux --sort=-%mem | head
Process tree pstree -ap
Process command ps -fp <pid>
Process files lsof -p <pid>
Process cwd readlink -f /proc/<pid>/cwd
Process env tr '\0' '\n' < /proc/<pid>/environ
Threads ps -T -p <pid>
Kill gracefully kill -TERM <pid>
Kill forcefully kill -KILL <pid>

2. Memory#

一句话:先判断是真内存压力,还是 page cache 看起来占用高。

Need Command
Memory summary free -h
Memory trend vmstat 1
Top memory processes ps aux --sort=-%mem | head -20
OOM logs dmesg -T | grep -i -E 'oom|killed process'
Swap usage swapon --show
Per process memory pmap -x <pid> | tail -1
cgroup memory cat /sys/fs/cgroup/memory.current
slab summary slabtop

3. Disk#

一句话:磁盘问题分三类查:空间、inode、IO。

Need Command
Filesystem usage df -h
Inode usage df -ih
Directory size du -h --max-depth=1 <path> | sort -h
Large files find <path> -type f -size +1G -ls
Deleted open files lsof +L1
Block devices lsblk -f
Mounts findmnt
Disk IO iostat -xz 1
Per process IO iotop -oPa

4. Network#

一句话:先确认监听,再确认连接,再确认 DNS / route / firewall。

Need Command
Listening TCP ports ss -lntp
Listening UDP ports ss -lnup
Active connections ss -antp
Process by port lsof -i :<port>
DNS lookup dig <domain>
DNS with public resolver dig @8.8.8.8 <domain>
HTTP check curl -v http://<host>:<port>/
TLS check openssl s_client -connect <host>:443 -servername <domain>
Route ip route
Address ip addr
Capture packets tcpdump -i <iface> host <ip> and port <port>

5. Logs#

一句话:systemd 服务先看 journal,传统应用再看 /var/log

Need Command
Service logs journalctl -u <service> -n 200
Follow service logs journalctl -u <service> -f
Boot logs journalctl -b
Error logs journalctl -p err -n 100
Kernel logs dmesg -T
Follow file tail -f <file>
Search gzip logs zgrep '<pattern>' <file>.gz

6. System#

一句话:系统状态先看启动时间、负载、内核、资源限制。

Need Command
Uptime / load uptime
Kernel uname -a
OS release cat /etc/os-release
CPU info lscpu
Current time date
Time sync status timedatectl
Limits ulimit -a
Open files limit cat /proc/<pid>/limits
Timers systemctl list-timers
Failed units systemctl --failed