AWS DynamoDB Monitoring


https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/metrics-dimensions.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity Alert Meaning Why Monitor Definition Duration
P1 Throttled requests DynamoDB 请求被限流 限流会导致应用错误、重试放大和延迟升高 ThrottledRequests > 0 5m
P1 System errors DynamoDB 服务端错误 代表 AWS 服务侧或短暂可用性问题,需要观察影响面 SystemErrors > 0 5m
P1 GetItem/Query p95 high 读请求成功但尾延迟高 发现分区热点、索引设计或访问模式问题 SuccessfulRequestLatency p95 >= 50ms 10m
P1 Put/Update p95 high 写请求成功但尾延迟高 发现写热点、条件写冲突或容量压力 SuccessfulRequestLatency p95 >= 100ms 10m
P2 Read capacity high 读容量使用率接近 provisioned 上限 提前发现读限流风险和扩容需求 ConsumedReadCapacityUnits / ProvisionedReadCapacityUnits >= 80% 10m
P2 Write capacity high 写容量使用率接近 provisioned 上限 提前发现写限流风险和扩容需求 ConsumedWriteCapacityUnits / ProvisionedWriteCapacityUnits >= 80% 10m
P2 User errors high 客户端请求错误数量高 发现参数错误、条件失败、权限或代码调用问题 UserErrors >= 10 in 5m 5m

On-Demand 表也要看 throttle 和 latency;capacity ratio 只适合 provisioned 表。

2. CloudWatch Metrics#

Metric Meaning Why Monitor Namespace Statistic Period
ThrottledRequests 被 DynamoDB 限流的请求数 判断容量不足、热点分区或 on-demand 突发限制 AWS/DynamoDB Sum 300s
SystemErrors DynamoDB 服务端错误数 识别 AWS 服务侧异常和可用性影响 AWS/DynamoDB Sum 300s
UserErrors 客户端侧请求错误数 识别代码、参数、权限或条件表达式问题 AWS/DynamoDB Sum 300s
SuccessfulRequestLatency 成功请求的响应延迟 衡量 DynamoDB 操作的性能和访问模式健康 AWS/DynamoDB p95 60s
ConsumedReadCapacityUnits 周期内消耗的读容量 计算读容量使用率和成本/扩容趋势 AWS/DynamoDB Sum 300s
ProvisionedReadCapacityUnits 配置的读容量 作为读容量使用率分母,判断是否接近上限 AWS/DynamoDB Average 300s
ConsumedWriteCapacityUnits 周期内消耗的写容量 计算写容量使用率和写入压力趋势 AWS/DynamoDB Sum 300s
ProvisionedWriteCapacityUnits 配置的写容量 作为写容量使用率分母,判断是否接近上限 AWS/DynamoDB Average 300s

Read capacity 使用率:

[
  {
    "Id": "consumed",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/DynamoDB",
        "MetricName": "ConsumedReadCapacityUnits",
        "Dimensions": [{ "Name": "TableName", "Value": "prod-orders" }]
      },
      "Period": 300,
      "Stat": "Sum"
    },
    "ReturnData": false
  },
  {
    "Id": "provisioned",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/DynamoDB",
        "MetricName": "ProvisionedReadCapacityUnits",
        "Dimensions": [{ "Name": "TableName", "Value": "prod-orders" }]
      },
      "Period": 300,
      "Stat": "Average"
    },
    "ReturnData": false
  },
  {
    "Id": "read_capacity_percent",
    "Expression": "IF(provisioned>0,100*consumed/(provisioned*300),0)",
    "Label": "Read capacity percent",
    "ReturnData": true
  }
]

ConsumedReadCapacityUnits 是周期内总消耗,ProvisionedReadCapacityUnits 是每秒容量,所以 300 秒周期要乘以 300

3. PromQL#

# 用正则先确认 YACE 暴露的 DynamoDB metric 名称。
# 这里不是告警表达式,只用于找真实 metric 和 label。
{__name__=~"aws_dynamodb_.*(throttled|latency|error|capacity).*"}
# 最近 5 分钟出现过 throttle。
# throttle 表示 DynamoDB 拒绝或延迟了请求,通常要看容量、热分区或 on-demand 突增。
increase(aws_dynamodb_throttled_requests_sum{table_name="prod-orders"}[5m]) > 0
# 最近 5 分钟出现过 AWS 服务端系统错误。
# SystemErrors 不是应用参数错误,应该按可用性问题处理。
increase(aws_dynamodb_system_errors_sum{table_name="prod-orders"}[5m]) > 0
# GetItem / Query 的 p95 延迟超过 50ms。
# operation label 用正则匹配读操作;写操作建议用 100ms 阈值。
aws_dynamodb_successful_request_latency_p95{table_name="prod-orders",operation=~"GetItem|Query"} >= 50
# Provisioned read capacity 使用率:
#   rate(consumed capacity sum[5m]) = 每秒消耗的 RCU
#   provisioned average = 配置的每秒 RCU
#   二者相除再 * 100 = 使用率百分比
# >= 80 表示读容量接近瓶颈。
100 *
rate(aws_dynamodb_consumed_read_capacity_units_sum{table_name="prod-orders"}[5m])
/
clamp_min(aws_dynamodb_provisioned_read_capacity_units_average{table_name="prod-orders"}, 1)
>= 80

4. vmalert Rules#

groups:
  - name: dynamodb.rules
    rules:
      - alert: DynamoDBThrottledRequests
        # 最近 5 分钟 throttle 增量大于 0。
        expr: increase(aws_dynamodb_throttled_requests_sum[5m]) > 0
        for: 5m
        labels:
          severity: P1
          component: dynamodb
        annotations:
          summary: "DynamoDB throttled requests occurred"

      - alert: DynamoDBReadCapacityHigh
        # 每秒消耗的 RCU / provisioned RCU * 100。
        expr: |
          100 * rate(aws_dynamodb_consumed_read_capacity_units_sum[5m])
          /
          clamp_min(aws_dynamodb_provisioned_read_capacity_units_average, 1) >= 80
        for: 10m
        labels:
          severity: P2
          component: dynamodb
        annotations:
          summary: "DynamoDB read capacity usage is >= 80%"