AWS Aurora PostgreSQL Monitoring


https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraMonitoring.Metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity Alert Meaning Why Monitor Definition Duration
P1 CPU critical 数据库实例 CPU 长时间接近满载 CPU 饱和会放大查询延迟并影响连接处理 CPUUtilization >= 90% 10m
P1 FreeableMemory low 可释放内存过低 内存不足会导致缓存效率下降、swap 或进程不稳定 < 1GiB or < 10% instance memory 10m
P1 Connections near limit 数据库连接数接近上限 连接耗尽会导致新请求失败和应用连接池阻塞 DatabaseConnections >= 80% max_connections 10m
P1 ReadLatency high 读 IO 延迟明显升高 影响查询性能,常见于存储压力或慢查询 ReadLatency >= 0.02s and > 2 * 1h average 10m
P1 WriteLatency high 写 IO 延迟明显升高 影响事务提交和写入吞吐,可能导致应用超时 WriteLatency >= 0.02s and > 2 * 1h average 10m
P1 Replica lag serious Aurora 副本复制延迟严重 读副本可能返回旧数据,故障切换风险增加 AuroraReplicaLagMaximum >= 120s 5m
P1 FreeStorageSpace low 可用存储空间过低 存储耗尽会影响写入和数据库稳定性 < 10GiB 10m
P2 Disk queue high 磁盘 IO 队列持续堆积 说明存储处理不过来,通常会推高读写延迟 DiskQueueDepth >= 10 10m
P2 Deadlocks 数据库发生死锁 说明事务设计或并发写入存在问题,会导致请求失败 Deadlocks > 0 5m

如果实例内存已知,优先用比例:

FreeableMemory / instance_memory_bytes < 0.10

如果实例内存未知,生产默认:

FreeableMemory < 1073741824 bytes
for 10m

DatabaseConnections 的上限必须从数据库查,不要猜:

show max_connections;
select count(*) from pg_stat_activity;

2. CloudWatch Metrics#

Metric Meaning Why Monitor Namespace Statistic Period
CPUUtilization DB 实例 CPU 使用率 判断查询负载和实例规格是否接近瓶颈 AWS/RDS Average 60s
FreeableMemory DB 实例可释放内存 判断缓存压力、内存泄漏或实例规格不足 AWS/RDS Average 60s
DatabaseConnections 当前数据库连接数 判断连接池、连接泄漏和 max_connections 风险 AWS/RDS Average 60s
ReadLatency 读 IO 平均延迟 发现存储读压力和查询性能退化 AWS/RDS Average 60s
WriteLatency 写 IO 平均延迟 发现事务提交、写入和存储性能问题 AWS/RDS Average 60s
DiskQueueDepth 等待磁盘处理的 IO 队列深度 判断存储层是否成为瓶颈 AWS/RDS Average 60s
FreeStorageSpace 可用存储空间 防止空间耗尽导致写入失败或实例异常 AWS/RDS Average 60s
AuroraReplicaLagMaximum Aurora 副本最大复制延迟 判断读副本新鲜度和故障切换风险 AWS/RDS Maximum 60s
Deadlocks 周期内死锁数量 发现事务并发冲突和 SQL/索引设计问题 AWS/RDS Sum 300s

FreeableMemory 查询:

[
  {
    "Id": "free_mem",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/RDS",
        "MetricName": "FreeableMemory",
        "Dimensions": [
          { "Name": "DBInstanceIdentifier", "Value": "prod-aurora-writer" }
        ]
      },
      "Period": 60,
      "Stat": "Average"
    },
    "ReturnData": true
  }
]
aws cloudwatch get-metric-data \
  --start-time 2026-06-02T00:00:00Z \
  --end-time 2026-06-02T01:00:00Z \
  --metric-data-queries file://rds-freeable-memory.json

ReadLatency 明显升高用 CloudWatch anomaly detection:

metric: AWS/RDS ReadLatency Average 60s
comparison: GreaterThanUpperThreshold
threshold_metric_id: ad1
metric math: ad1 = ANOMALY_DETECTION_BAND(m1, 2)
evaluation_periods: 10
datapoints_to_alarm: 10
extra static alarm: ReadLatency >= 0.02 seconds for 10 datapoints

Connections 接近上限的 CloudWatch metric math:

max_connections: 500
expression: 100 * connections / 500
threshold: 80

3. PromQL#

先确认 YACE metric:

# 用正则先确认 YACE 暴露的 RDS/Aurora metric 名称。
# 这里不是告警,只是确认 metric name 和 label name。
{__name__=~"aws_rds_.*(cpu|freeable|connection|latency|lag|deadlock).*"}
# Aurora writer CPU 平均使用率超过 90%。
# 持续 10 分钟才告警,避免短促 spike 误报。
aws_rds_cpu_utilization_average{dbinstance_identifier="prod-aurora-writer"} >= 90
# FreeableMemory 低于 1GiB。
# 如果你知道实例总内存,更推荐用 FreeableMemory / instance_memory < 10%。
aws_rds_freeable_memory_average{dbinstance_identifier="prod-aurora-writer"} < 1073741824
# 连接数使用率:
#   当前连接数 / max_connections * 100
#   这里的 500 必须替换成 SQL `show max_connections;` 查出来的值
# >= 80 表示连接池或应用连接数接近上限。
100 *
aws_rds_database_connections_average{dbinstance_identifier="prod-aurora-writer"}
/
500
>= 80

ReadLatency / WriteLatency 明显升高:

# ReadLatency 明显升高:
#   第一段 >= 0.02 表示平均读延迟至少 20ms
#   第二段 > 2 * 1h average 表示比最近 1 小时均值翻倍
# 两个条件同时满足,避免低基线的小波动误报。
aws_rds_read_latency_average{dbinstance_identifier="prod-aurora-writer"} >= 0.02
and
aws_rds_read_latency_average{dbinstance_identifier="prod-aurora-writer"}
  > 2 * avg_over_time(aws_rds_read_latency_average{dbinstance_identifier="prod-aurora-writer"}[1h])
# WriteLatency 明显升高:
#   >= 0.02 表示平均写延迟至少 20ms
#   > 2 * 1h average 表示比最近 1 小时均值翻倍
# 写延迟升高通常和存储、锁等待、checkpoint 或写入压力有关。
aws_rds_write_latency_average{dbinstance_identifier="prod-aurora-writer"} >= 0.02
and
aws_rds_write_latency_average{dbinstance_identifier="prod-aurora-writer"}
  > 2 * avg_over_time(aws_rds_write_latency_average{dbinstance_identifier="prod-aurora-writer"}[1h])
# Aurora replica lag 最大值超过 120 秒。
# 读写分离场景下,这会导致 reader 读到明显过期的数据。
aws_rds_aurora_replica_lag_maximum_maximum{dbcluster_identifier="prod-aurora"} >= 120

4. vmalert Rules#

groups:
  - name: aurora-postgresql.rules
    rules:
      - alert: AuroraFreeableMemoryLow
        # RDS FreeableMemory 单位是 bytes;1073741824 = 1GiB。
        expr: aws_rds_freeable_memory_average < 1073741824
        for: 10m
        labels:
          severity: P1
          component: aurora
        annotations:
          summary: "Aurora FreeableMemory is below 1GiB"

      - alert: AuroraReadLatencyHigh
        # 同时满足绝对阈值和相对基线,定义为“明显升高”。
        expr: |
          aws_rds_read_latency_average >= 0.02
          and
          aws_rds_read_latency_average > 2 * avg_over_time(aws_rds_read_latency_average[1h])
        for: 10m
        labels:
          severity: P1
          component: aurora
        annotations:
          summary: "Aurora ReadLatency is high: >=20ms and >2x 1h average"

      - alert: AuroraReplicaLagSerious
        # reader 和 writer 的复制延迟超过 120 秒。
        expr: aws_rds_aurora_replica_lag_maximum_maximum >= 120
        for: 5m
        labels:
          severity: P1
          component: aurora
        annotations:
          summary: "Aurora replica lag is >=120s"