Links#
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/1. Alert Standard#
| Severity | Alert | Meaning | Why Monitor | Definition | Duration |
|---|---|---|---|---|---|
| P1 | Oldest message too old | 最老的可见消息已经等待很久 | 直接反映消费延迟和业务 SLA 风险 | ApproximateAgeOfOldestMessage >= 300s |
10m |
| P1 | DLQ has messages | DLQ 中出现未处理失败消息 | 说明正常消费链路已经放弃处理,需要人工排查 | ApproximateNumberOfMessagesVisible >= 1 on DLQ |
1m |
| P1 | Drain time too long | 按当前消费速度清空 backlog 需要很久 | 比单看 backlog 更能判断恢复时间是否可接受 | visible / delete_rate_per_second >= 900s |
10m |
| P2 | Visible backlog high | 队列中等待消费的可见消息很多 | 提前发现 consumer 处理能力不足或流量突增 | ApproximateNumberOfMessagesVisible >= 1000 |
15m |
| P2 | Consumer stalled | 有消息进入但没有被删除 | 发现 consumer 卡死、权限错误或处理失败循环 | sent > 0 and deleted == 0 |
10m |
| P2 | NotVisible high | 大量消息正在被 consumer 持有但未完成 | 发现处理慢、visibility timeout 不合理或 worker 堵塞 | ApproximateNumberOfMessagesNotVisible >= 500 |
15m |
业务 SLA 更严格时,以业务 SLA 覆盖默认值:
oldest_message_age_threshold = min(300s, 0.5 * business_sla_seconds)
drain_time_threshold = min(900s, business_sla_seconds)2. CloudWatch Metrics#
| Metric | Meaning | Why Monitor | Namespace | Statistic | Period |
|---|---|---|---|---|---|
| ApproximateAgeOfOldestMessage | 队列中最老可见消息的等待时间 | 衡量消费延迟和 SLA 风险 | AWS/SQS | Maximum | 60s |
| ApproximateNumberOfMessagesVisible | 等待被消费的可见消息数 | 衡量 backlog 和 consumer 处理能力 | AWS/SQS | Average | 60s |
| ApproximateNumberOfMessagesNotVisible | 已被接收但尚未删除的消息数 | 识别处理慢、卡住或 visibility timeout 问题 | AWS/SQS | Average | 60s |
| NumberOfMessagesSent | 周期内发送到队列的消息数 | 判断入口流量、突增和 drain time 计算 | AWS/SQS | Sum | 300s |
| NumberOfMessagesDeleted | 周期内成功删除的消息数 | 代表实际消费完成速度,用于判断 consumer 是否工作 | AWS/SQS | Sum | 300s |
Drain time metric math:
[
{
"Id": "visible",
"MetricStat": {
"Metric": {
"Namespace": "AWS/SQS",
"MetricName": "ApproximateNumberOfMessagesVisible",
"Dimensions": [{ "Name": "QueueName", "Value": "prod-job-queue" }]
},
"Period": 300,
"Stat": "Average"
},
"ReturnData": false
},
{
"Id": "deleted",
"MetricStat": {
"Metric": {
"Namespace": "AWS/SQS",
"MetricName": "NumberOfMessagesDeleted",
"Dimensions": [{ "Name": "QueueName", "Value": "prod-job-queue" }]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "drain_seconds",
"Expression": "IF(deleted>0,visible/(deleted/300),999999)",
"Label": "Estimated drain time seconds",
"ReturnData": true
}
]CloudWatch Alarm 条件:
metric math id: drain_seconds
comparison: GreaterThanOrEqualToThreshold
threshold: 900
evaluation_periods: 2
datapoints_to_alarm: 2
period: 300
severity: P13. PromQL#
# 用正则先确认 YACE 暴露的 SQS metric 名称。
# 这里不是告警,只是检查 queue_name 等 label 是否符合当前环境。
{__name__=~"aws_sqs_.*(oldest|visible|sent|deleted).*"}# 队列最老消息年龄超过 300 秒。
# 表示至少有一条消息 5 分钟没有被成功消费。
aws_sqs_approximate_age_of_oldest_message_maximum{queue_name="prod-job-queue"} >= 300# DLQ 可见消息数 >= 1。
# DLQ 有消息通常表示主队列消费失败,需要人工或自动回放处理。
aws_sqs_approximate_number_of_messages_visible_average{queue_name="prod-job-dlq"} >= 1# 估算队列清空时间 drain time:
# visible messages / delete rate per second
# delete rate 用最近 5 分钟 NumberOfMessagesDeleted 的 rate
# clamp_min(..., 0.01) 防止消费速率为 0 时除以 0
# >= 900 表示按当前消费速度,清空 backlog 需要 15 分钟以上。
aws_sqs_approximate_number_of_messages_visible_average{queue_name="prod-job-queue"}
/
clamp_min(rate(aws_sqs_number_of_messages_deleted_sum{queue_name="prod-job-queue"}[5m]), 0.01)
>= 900# 消费者停滞:
# 最近 10 分钟有消息进入队列
# 但最近 10 分钟没有任何消息被 deleted
# 这通常表示 worker 停了、权限错误、代码卡住或 visibility timeout 设置异常。
increase(aws_sqs_number_of_messages_sent_sum{queue_name="prod-job-queue"}[10m]) > 0
and
increase(aws_sqs_number_of_messages_deleted_sum{queue_name="prod-job-queue"}[10m]) == 04. vmalert Rules#
groups:
- name: sqs.rules
rules:
- alert: SQSOldestMessageTooOld
# oldest message age 超过 300 秒,表示队列处理延迟已经超过默认 SLA。
expr: aws_sqs_approximate_age_of_oldest_message_maximum >= 300
for: 10m
labels:
severity: P1
component: sqs
annotations:
summary: "SQS oldest message age is >= 300s"
- alert: SQSDlqHasMessages
# DLQ 中任何可见消息都需要告警,因为它代表消费失败已经发生。
expr: aws_sqs_approximate_number_of_messages_visible_average{queue_name=~".*dlq.*"} >= 1
for: 1m
labels:
severity: P1
component: sqs
annotations:
summary: "SQS DLQ has visible messages"