Links#
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/1. Alert Standard#
| Severity | Alert | Meaning | Why Monitor | Definition | Duration |
|---|---|---|---|---|---|
| P0 | EC2 status check failed | EC2 实例系统或实例状态检查失败 | 代表底层实例不可用或网络/系统异常,需要兜底告警 | StatusCheckFailed > 0 |
2m |
| P0 | Monitoring host down | 监控主机 node exporter 不可抓取 | 监控链路核心节点不可用会导致后续告警失效 | up{job="node"} == 0 |
2m |
| P1 | Disk almost full | 文件系统使用率接近满 | 磁盘满会导致写入失败、服务异常和日志丢失 | filesystem used >= 90% |
10m |
| P1 | Memory almost full | 内存使用率接近满 | 防止 OOM、swap 抖动和进程被杀 | memory used >= 90% |
10m |
| P1 | Alertmanager down | Alertmanager 不可用 | 告警路由和通知可能中断,需要尽快恢复 | up{job="alertmanager"} == 0 |
2m |
| P1 | VictoriaMetrics down | VictoriaMetrics 不可用 | 指标存储和查询中断会影响监控与告警 | up{job="victoriametrics"} == 0 |
2m |
| P2 | CPU high | 实例 CPU 使用率持续高 | 发现容量不足、异常进程或负载升高 | CPUUtilization >= 80% |
15m |
| P2 | EBS burst balance low | EBS 突发积分不足 | gp2/st1/sc1 等卷积分耗尽会导致 IO 性能下降 | BurstBalance < 20% |
15m |
监控平台 EC2 的 P0/P1 必须保留 CloudWatch Alarm 兜底,不依赖本机 Alertmanager。
2. CloudWatch Metrics#
| Metric | Meaning | Why Monitor | Namespace | Statistic | Period |
|---|---|---|---|---|---|
| StatusCheckFailed | EC2 实例或系统状态检查失败 | AWS 层面判断实例健康,适合作为监控兜底 | AWS/EC2 | Maximum | 60s |
| CPUUtilization | EC2 CPU 使用率 | 判断计算资源瓶颈和异常负载 | AWS/EC2 | Average | 60s |
| disk_used_percent | CloudWatch Agent 上报的磁盘使用率 | 防止磁盘满导致服务和日志写入失败 | CWAgent | Average | 60s |
| mem_used_percent | CloudWatch Agent 上报的内存使用率 | 判断内存压力和 OOM 风险 | CWAgent | Average | 60s |
| BurstBalance | EBS 卷剩余突发积分百分比 | 防止突发积分耗尽后 IO 性能下降 | AWS/EBS | Average | 60s |
Status check 查询:
[
{
"Id": "status_failed",
"MetricStat": {
"Metric": {
"Namespace": "AWS/EC2",
"MetricName": "StatusCheckFailed",
"Dimensions": [{ "Name": "InstanceId", "Value": "i-0123456789abcdef0" }]
},
"Period": 60,
"Stat": "Maximum"
},
"ReturnData": true
}
]CloudWatch Alarm 条件:
metric: StatusCheckFailed
comparison: GreaterThanThreshold
threshold: 0
evaluation_periods: 2
datapoints_to_alarm: 2
period: 60
severity: P03. PromQL#
# node exporter 无法被 Prometheus/VictoriaMetrics 抓取。
# 对监控平台 EC2 来说,这表示主监控链路已经不可用。
up{job="node",instance="monitoring-ec2:9100"} == 0# 根分区磁盘使用率:
# 1 - available / total = 已使用比例
# 乘以 100 = 百分比
# 排除 tmpfs 和 overlay,避免临时文件系统或容器层干扰
# >= 90 表示磁盘几乎满了。
100 *
(1 - node_filesystem_avail_bytes{mountpoint="/",fstype!~"tmpfs|overlay"} /
node_filesystem_size_bytes{mountpoint="/",fstype!~"tmpfs|overlay"})
>= 90# 内存使用率:
# MemAvailable 是 Linux 认为还能分配给应用的内存
# 1 - MemAvailable / MemTotal = 实际压力更接近的内存使用率
# >= 90 表示内存压力很高。
100 *
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
>= 90# CPU 使用率:
# idle mode 的 rate 是 CPU 空闲比例
# 100 - idle_percent = CPU 使用率
# avg by (instance) 按机器聚合所有 CPU core。
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 >= 804. vmalert Rules#
groups:
- name: ec2.rules
rules:
- alert: MonitoringEC2Down
# node exporter down,说明监控平台机器或 exporter 不可达。
expr: up{job="node"} == 0
for: 2m
labels:
severity: P0
component: ec2
annotations:
summary: "Monitoring EC2 node exporter is down"
- alert: EC2DiskAlmostFull
# 文件系统使用率超过 90%,排除 tmpfs / overlay。
expr: |
100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) >= 90
for: 10m
labels:
severity: P1
component: ec2
annotations:
summary: "EC2 disk usage is >= 90%"