1. Process#
一句话:先找是谁在跑、占了什么资源、监听了什么端口。
| Need | Command |
|---|---|
| CPU top processes | ps aux --sort=-%cpu | head |
| Memory top processes | ps aux --sort=-%mem | head |
| Process tree | pstree -ap |
| Process command | ps -fp <pid> |
| Process files | lsof -p <pid> |
| Process cwd | readlink -f /proc/<pid>/cwd |
| Process env | tr '\0' '\n' < /proc/<pid>/environ |
| Threads | ps -T -p <pid> |
| Kill gracefully | kill -TERM <pid> |
| Kill forcefully | kill -KILL <pid> |
2. Memory#
一句话:先判断是真内存压力,还是 page cache 看起来占用高。
| Need | Command |
|---|---|
| Memory summary | free -h |
| Memory trend | vmstat 1 |
| Top memory processes | ps aux --sort=-%mem | head -20 |
| OOM logs | dmesg -T | grep -i -E 'oom|killed process' |
| Swap usage | swapon --show |
| Per process memory | pmap -x <pid> | tail -1 |
| cgroup memory | cat /sys/fs/cgroup/memory.current |
| slab summary | slabtop |
3. Disk#
一句话:磁盘问题分三类查:空间、inode、IO。
| Need | Command |
|---|---|
| Filesystem usage | df -h |
| Inode usage | df -ih |
| Directory size | du -h --max-depth=1 <path> | sort -h |
| Large files | find <path> -type f -size +1G -ls |
| Deleted open files | lsof +L1 |
| Block devices | lsblk -f |
| Mounts | findmnt |
| Disk IO | iostat -xz 1 |
| Per process IO | iotop -oPa |
4. Network#
一句话:先确认监听,再确认连接,再确认 DNS / route / firewall。
| Need | Command |
|---|---|
| Listening TCP ports | ss -lntp |
| Listening UDP ports | ss -lnup |
| Active connections | ss -antp |
| Process by port | lsof -i :<port> |
| DNS lookup | dig <domain> |
| DNS with public resolver | dig @8.8.8.8 <domain> |
| HTTP check | curl -v http://<host>:<port>/ |
| TLS check | openssl s_client -connect <host>:443 -servername <domain> |
| Route | ip route |
| Address | ip addr |
| Capture packets | tcpdump -i <iface> host <ip> and port <port> |
5. Logs#
一句话:systemd 服务先看 journal,传统应用再看 /var/log。
| Need | Command |
|---|---|
| Service logs | journalctl -u <service> -n 200 |
| Follow service logs | journalctl -u <service> -f |
| Boot logs | journalctl -b |
| Error logs | journalctl -p err -n 100 |
| Kernel logs | dmesg -T |
| Follow file | tail -f <file> |
| Search gzip logs | zgrep '<pattern>' <file>.gz |
6. System#
一句话:系统状态先看启动时间、负载、内核、资源限制。
| Need | Command |
|---|---|
| Uptime / load | uptime |
| Kernel | uname -a |
| OS release | cat /etc/os-release |
| CPU info | lscpu |
| Current time | date |
| Time sync status | timedatectl |
| Limits | ulimit -a |
| Open files limit | cat /proc/<pid>/limits |
| Timers | systemctl list-timers |
| Failed units | systemctl --failed |