Links#
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/metrics-dimensions.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/1. Alert Standard#
| Severity | Alert | Meaning | Why Monitor | Definition | Duration |
|---|---|---|---|---|---|
| P1 | Throttled requests | DynamoDB 请求被限流 | 限流会导致应用错误、重试放大和延迟升高 | ThrottledRequests > 0 |
5m |
| P1 | System errors | DynamoDB 服务端错误 | 代表 AWS 服务侧或短暂可用性问题,需要观察影响面 | SystemErrors > 0 |
5m |
| P1 | GetItem/Query p95 high | 读请求成功但尾延迟高 | 发现分区热点、索引设计或访问模式问题 | SuccessfulRequestLatency p95 >= 50ms |
10m |
| P1 | Put/Update p95 high | 写请求成功但尾延迟高 | 发现写热点、条件写冲突或容量压力 | SuccessfulRequestLatency p95 >= 100ms |
10m |
| P2 | Read capacity high | 读容量使用率接近 provisioned 上限 | 提前发现读限流风险和扩容需求 | ConsumedReadCapacityUnits / ProvisionedReadCapacityUnits >= 80% |
10m |
| P2 | Write capacity high | 写容量使用率接近 provisioned 上限 | 提前发现写限流风险和扩容需求 | ConsumedWriteCapacityUnits / ProvisionedWriteCapacityUnits >= 80% |
10m |
| P2 | User errors high | 客户端请求错误数量高 | 发现参数错误、条件失败、权限或代码调用问题 | UserErrors >= 10 in 5m |
5m |
On-Demand 表也要看 throttle 和 latency;capacity ratio 只适合 provisioned 表。
2. CloudWatch Metrics#
| Metric | Meaning | Why Monitor | Namespace | Statistic | Period |
|---|---|---|---|---|---|
| ThrottledRequests | 被 DynamoDB 限流的请求数 | 判断容量不足、热点分区或 on-demand 突发限制 | AWS/DynamoDB | Sum | 300s |
| SystemErrors | DynamoDB 服务端错误数 | 识别 AWS 服务侧异常和可用性影响 | AWS/DynamoDB | Sum | 300s |
| UserErrors | 客户端侧请求错误数 | 识别代码、参数、权限或条件表达式问题 | AWS/DynamoDB | Sum | 300s |
| SuccessfulRequestLatency | 成功请求的响应延迟 | 衡量 DynamoDB 操作的性能和访问模式健康 | AWS/DynamoDB | p95 | 60s |
| ConsumedReadCapacityUnits | 周期内消耗的读容量 | 计算读容量使用率和成本/扩容趋势 | AWS/DynamoDB | Sum | 300s |
| ProvisionedReadCapacityUnits | 配置的读容量 | 作为读容量使用率分母,判断是否接近上限 | AWS/DynamoDB | Average | 300s |
| ConsumedWriteCapacityUnits | 周期内消耗的写容量 | 计算写容量使用率和写入压力趋势 | AWS/DynamoDB | Sum | 300s |
| ProvisionedWriteCapacityUnits | 配置的写容量 | 作为写容量使用率分母,判断是否接近上限 | AWS/DynamoDB | Average | 300s |
Read capacity 使用率:
[
{
"Id": "consumed",
"MetricStat": {
"Metric": {
"Namespace": "AWS/DynamoDB",
"MetricName": "ConsumedReadCapacityUnits",
"Dimensions": [{ "Name": "TableName", "Value": "prod-orders" }]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "provisioned",
"MetricStat": {
"Metric": {
"Namespace": "AWS/DynamoDB",
"MetricName": "ProvisionedReadCapacityUnits",
"Dimensions": [{ "Name": "TableName", "Value": "prod-orders" }]
},
"Period": 300,
"Stat": "Average"
},
"ReturnData": false
},
{
"Id": "read_capacity_percent",
"Expression": "IF(provisioned>0,100*consumed/(provisioned*300),0)",
"Label": "Read capacity percent",
"ReturnData": true
}
]ConsumedReadCapacityUnits 是周期内总消耗,ProvisionedReadCapacityUnits 是每秒容量,所以 300 秒周期要乘以 300。
3. PromQL#
# 用正则先确认 YACE 暴露的 DynamoDB metric 名称。
# 这里不是告警表达式,只用于找真实 metric 和 label。
{__name__=~"aws_dynamodb_.*(throttled|latency|error|capacity).*"}# 最近 5 分钟出现过 throttle。
# throttle 表示 DynamoDB 拒绝或延迟了请求,通常要看容量、热分区或 on-demand 突增。
increase(aws_dynamodb_throttled_requests_sum{table_name="prod-orders"}[5m]) > 0# 最近 5 分钟出现过 AWS 服务端系统错误。
# SystemErrors 不是应用参数错误,应该按可用性问题处理。
increase(aws_dynamodb_system_errors_sum{table_name="prod-orders"}[5m]) > 0# GetItem / Query 的 p95 延迟超过 50ms。
# operation label 用正则匹配读操作;写操作建议用 100ms 阈值。
aws_dynamodb_successful_request_latency_p95{table_name="prod-orders",operation=~"GetItem|Query"} >= 50# Provisioned read capacity 使用率:
# rate(consumed capacity sum[5m]) = 每秒消耗的 RCU
# provisioned average = 配置的每秒 RCU
# 二者相除再 * 100 = 使用率百分比
# >= 80 表示读容量接近瓶颈。
100 *
rate(aws_dynamodb_consumed_read_capacity_units_sum{table_name="prod-orders"}[5m])
/
clamp_min(aws_dynamodb_provisioned_read_capacity_units_average{table_name="prod-orders"}, 1)
>= 804. vmalert Rules#
groups:
- name: dynamodb.rules
rules:
- alert: DynamoDBThrottledRequests
# 最近 5 分钟 throttle 增量大于 0。
expr: increase(aws_dynamodb_throttled_requests_sum[5m]) > 0
for: 5m
labels:
severity: P1
component: dynamodb
annotations:
summary: "DynamoDB throttled requests occurred"
- alert: DynamoDBReadCapacityHigh
# 每秒消耗的 RCU / provisioned RCU * 100。
expr: |
100 * rate(aws_dynamodb_consumed_read_capacity_units_sum[5m])
/
clamp_min(aws_dynamodb_provisioned_read_capacity_units_average, 1) >= 80
for: 10m
labels:
severity: P2
component: dynamodb
annotations:
summary: "DynamoDB read capacity usage is >= 80%"