Links#
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraMonitoring.Metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/1. Alert Standard#
| Severity | Alert | Meaning | Why Monitor | Definition | Duration |
|---|---|---|---|---|---|
| P1 | CPU critical | 数据库实例 CPU 长时间接近满载 | CPU 饱和会放大查询延迟并影响连接处理 | CPUUtilization >= 90% |
10m |
| P1 | FreeableMemory low | 可释放内存过低 | 内存不足会导致缓存效率下降、swap 或进程不稳定 | < 1GiB or < 10% instance memory |
10m |
| P1 | Connections near limit | 数据库连接数接近上限 | 连接耗尽会导致新请求失败和应用连接池阻塞 | DatabaseConnections >= 80% max_connections |
10m |
| P1 | ReadLatency high | 读 IO 延迟明显升高 | 影响查询性能,常见于存储压力或慢查询 | ReadLatency >= 0.02s and > 2 * 1h average |
10m |
| P1 | WriteLatency high | 写 IO 延迟明显升高 | 影响事务提交和写入吞吐,可能导致应用超时 | WriteLatency >= 0.02s and > 2 * 1h average |
10m |
| P1 | Replica lag serious | Aurora 副本复制延迟严重 | 读副本可能返回旧数据,故障切换风险增加 | AuroraReplicaLagMaximum >= 120s |
5m |
| P1 | FreeStorageSpace low | 可用存储空间过低 | 存储耗尽会影响写入和数据库稳定性 | < 10GiB |
10m |
| P2 | Disk queue high | 磁盘 IO 队列持续堆积 | 说明存储处理不过来,通常会推高读写延迟 | DiskQueueDepth >= 10 |
10m |
| P2 | Deadlocks | 数据库发生死锁 | 说明事务设计或并发写入存在问题,会导致请求失败 | Deadlocks > 0 |
5m |
如果实例内存已知,优先用比例:
FreeableMemory / instance_memory_bytes < 0.10如果实例内存未知,生产默认:
FreeableMemory < 1073741824 bytes
for 10mDatabaseConnections 的上限必须从数据库查,不要猜:
show max_connections;
select count(*) from pg_stat_activity;2. CloudWatch Metrics#
| Metric | Meaning | Why Monitor | Namespace | Statistic | Period |
|---|---|---|---|---|---|
| CPUUtilization | DB 实例 CPU 使用率 | 判断查询负载和实例规格是否接近瓶颈 | AWS/RDS | Average | 60s |
| FreeableMemory | DB 实例可释放内存 | 判断缓存压力、内存泄漏或实例规格不足 | AWS/RDS | Average | 60s |
| DatabaseConnections | 当前数据库连接数 | 判断连接池、连接泄漏和 max_connections 风险 | AWS/RDS | Average | 60s |
| ReadLatency | 读 IO 平均延迟 | 发现存储读压力和查询性能退化 | AWS/RDS | Average | 60s |
| WriteLatency | 写 IO 平均延迟 | 发现事务提交、写入和存储性能问题 | AWS/RDS | Average | 60s |
| DiskQueueDepth | 等待磁盘处理的 IO 队列深度 | 判断存储层是否成为瓶颈 | AWS/RDS | Average | 60s |
| FreeStorageSpace | 可用存储空间 | 防止空间耗尽导致写入失败或实例异常 | AWS/RDS | Average | 60s |
| AuroraReplicaLagMaximum | Aurora 副本最大复制延迟 | 判断读副本新鲜度和故障切换风险 | AWS/RDS | Maximum | 60s |
| Deadlocks | 周期内死锁数量 | 发现事务并发冲突和 SQL/索引设计问题 | AWS/RDS | Sum | 300s |
FreeableMemory 查询:
[
{
"Id": "free_mem",
"MetricStat": {
"Metric": {
"Namespace": "AWS/RDS",
"MetricName": "FreeableMemory",
"Dimensions": [
{ "Name": "DBInstanceIdentifier", "Value": "prod-aurora-writer" }
]
},
"Period": 60,
"Stat": "Average"
},
"ReturnData": true
}
]aws cloudwatch get-metric-data \
--start-time 2026-06-02T00:00:00Z \
--end-time 2026-06-02T01:00:00Z \
--metric-data-queries file://rds-freeable-memory.jsonReadLatency 明显升高用 CloudWatch anomaly detection:
metric: AWS/RDS ReadLatency Average 60s
comparison: GreaterThanUpperThreshold
threshold_metric_id: ad1
metric math: ad1 = ANOMALY_DETECTION_BAND(m1, 2)
evaluation_periods: 10
datapoints_to_alarm: 10
extra static alarm: ReadLatency >= 0.02 seconds for 10 datapointsConnections 接近上限的 CloudWatch metric math:
max_connections: 500
expression: 100 * connections / 500
threshold: 803. PromQL#
先确认 YACE metric:
# 用正则先确认 YACE 暴露的 RDS/Aurora metric 名称。
# 这里不是告警,只是确认 metric name 和 label name。
{__name__=~"aws_rds_.*(cpu|freeable|connection|latency|lag|deadlock).*"}# Aurora writer CPU 平均使用率超过 90%。
# 持续 10 分钟才告警,避免短促 spike 误报。
aws_rds_cpu_utilization_average{dbinstance_identifier="prod-aurora-writer"} >= 90# FreeableMemory 低于 1GiB。
# 如果你知道实例总内存,更推荐用 FreeableMemory / instance_memory < 10%。
aws_rds_freeable_memory_average{dbinstance_identifier="prod-aurora-writer"} < 1073741824# 连接数使用率:
# 当前连接数 / max_connections * 100
# 这里的 500 必须替换成 SQL `show max_connections;` 查出来的值
# >= 80 表示连接池或应用连接数接近上限。
100 *
aws_rds_database_connections_average{dbinstance_identifier="prod-aurora-writer"}
/
500
>= 80ReadLatency / WriteLatency 明显升高:
# ReadLatency 明显升高:
# 第一段 >= 0.02 表示平均读延迟至少 20ms
# 第二段 > 2 * 1h average 表示比最近 1 小时均值翻倍
# 两个条件同时满足,避免低基线的小波动误报。
aws_rds_read_latency_average{dbinstance_identifier="prod-aurora-writer"} >= 0.02
and
aws_rds_read_latency_average{dbinstance_identifier="prod-aurora-writer"}
> 2 * avg_over_time(aws_rds_read_latency_average{dbinstance_identifier="prod-aurora-writer"}[1h])# WriteLatency 明显升高:
# >= 0.02 表示平均写延迟至少 20ms
# > 2 * 1h average 表示比最近 1 小时均值翻倍
# 写延迟升高通常和存储、锁等待、checkpoint 或写入压力有关。
aws_rds_write_latency_average{dbinstance_identifier="prod-aurora-writer"} >= 0.02
and
aws_rds_write_latency_average{dbinstance_identifier="prod-aurora-writer"}
> 2 * avg_over_time(aws_rds_write_latency_average{dbinstance_identifier="prod-aurora-writer"}[1h])# Aurora replica lag 最大值超过 120 秒。
# 读写分离场景下,这会导致 reader 读到明显过期的数据。
aws_rds_aurora_replica_lag_maximum_maximum{dbcluster_identifier="prod-aurora"} >= 1204. vmalert Rules#
groups:
- name: aurora-postgresql.rules
rules:
- alert: AuroraFreeableMemoryLow
# RDS FreeableMemory 单位是 bytes;1073741824 = 1GiB。
expr: aws_rds_freeable_memory_average < 1073741824
for: 10m
labels:
severity: P1
component: aurora
annotations:
summary: "Aurora FreeableMemory is below 1GiB"
- alert: AuroraReadLatencyHigh
# 同时满足绝对阈值和相对基线,定义为“明显升高”。
expr: |
aws_rds_read_latency_average >= 0.02
and
aws_rds_read_latency_average > 2 * avg_over_time(aws_rds_read_latency_average[1h])
for: 10m
labels:
severity: P1
component: aurora
annotations:
summary: "Aurora ReadLatency is high: >=20ms and >2x 1h average"
- alert: AuroraReplicaLagSerious
# reader 和 writer 的复制延迟超过 120 秒。
expr: aws_rds_aurora_replica_lag_maximum_maximum >= 120
for: 5m
labels:
severity: P1
component: aurora
annotations:
summary: "Aurora replica lag is >=120s"