AWS SQS Monitoring


https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity Alert Meaning Why Monitor Definition Duration
P1 Oldest message too old 最老的可见消息已经等待很久 直接反映消费延迟和业务 SLA 风险 ApproximateAgeOfOldestMessage >= 300s 10m
P1 DLQ has messages DLQ 中出现未处理失败消息 说明正常消费链路已经放弃处理,需要人工排查 ApproximateNumberOfMessagesVisible >= 1 on DLQ 1m
P1 Drain time too long 按当前消费速度清空 backlog 需要很久 比单看 backlog 更能判断恢复时间是否可接受 visible / delete_rate_per_second >= 900s 10m
P2 Visible backlog high 队列中等待消费的可见消息很多 提前发现 consumer 处理能力不足或流量突增 ApproximateNumberOfMessagesVisible >= 1000 15m
P2 Consumer stalled 有消息进入但没有被删除 发现 consumer 卡死、权限错误或处理失败循环 sent > 0 and deleted == 0 10m
P2 NotVisible high 大量消息正在被 consumer 持有但未完成 发现处理慢、visibility timeout 不合理或 worker 堵塞 ApproximateNumberOfMessagesNotVisible >= 500 15m

业务 SLA 更严格时,以业务 SLA 覆盖默认值:

oldest_message_age_threshold = min(300s, 0.5 * business_sla_seconds)
drain_time_threshold = min(900s, business_sla_seconds)

2. CloudWatch Metrics#

Metric Meaning Why Monitor Namespace Statistic Period
ApproximateAgeOfOldestMessage 队列中最老可见消息的等待时间 衡量消费延迟和 SLA 风险 AWS/SQS Maximum 60s
ApproximateNumberOfMessagesVisible 等待被消费的可见消息数 衡量 backlog 和 consumer 处理能力 AWS/SQS Average 60s
ApproximateNumberOfMessagesNotVisible 已被接收但尚未删除的消息数 识别处理慢、卡住或 visibility timeout 问题 AWS/SQS Average 60s
NumberOfMessagesSent 周期内发送到队列的消息数 判断入口流量、突增和 drain time 计算 AWS/SQS Sum 300s
NumberOfMessagesDeleted 周期内成功删除的消息数 代表实际消费完成速度,用于判断 consumer 是否工作 AWS/SQS Sum 300s

Drain time metric math:

[
  {
    "Id": "visible",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/SQS",
        "MetricName": "ApproximateNumberOfMessagesVisible",
        "Dimensions": [{ "Name": "QueueName", "Value": "prod-job-queue" }]
      },
      "Period": 300,
      "Stat": "Average"
    },
    "ReturnData": false
  },
  {
    "Id": "deleted",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/SQS",
        "MetricName": "NumberOfMessagesDeleted",
        "Dimensions": [{ "Name": "QueueName", "Value": "prod-job-queue" }]
      },
      "Period": 300,
      "Stat": "Sum"
    },
    "ReturnData": false
  },
  {
    "Id": "drain_seconds",
    "Expression": "IF(deleted>0,visible/(deleted/300),999999)",
    "Label": "Estimated drain time seconds",
    "ReturnData": true
  }
]

CloudWatch Alarm 条件:

metric math id: drain_seconds
comparison: GreaterThanOrEqualToThreshold
threshold: 900
evaluation_periods: 2
datapoints_to_alarm: 2
period: 300
severity: P1

3. PromQL#

# 用正则先确认 YACE 暴露的 SQS metric 名称。
# 这里不是告警,只是检查 queue_name 等 label 是否符合当前环境。
{__name__=~"aws_sqs_.*(oldest|visible|sent|deleted).*"}
# 队列最老消息年龄超过 300 秒。
# 表示至少有一条消息 5 分钟没有被成功消费。
aws_sqs_approximate_age_of_oldest_message_maximum{queue_name="prod-job-queue"} >= 300
# DLQ 可见消息数 >= 1。
# DLQ 有消息通常表示主队列消费失败,需要人工或自动回放处理。
aws_sqs_approximate_number_of_messages_visible_average{queue_name="prod-job-dlq"} >= 1
# 估算队列清空时间 drain time:
#   visible messages / delete rate per second
#   delete rate 用最近 5 分钟 NumberOfMessagesDeleted 的 rate
#   clamp_min(..., 0.01) 防止消费速率为 0 时除以 0
# >= 900 表示按当前消费速度,清空 backlog 需要 15 分钟以上。
aws_sqs_approximate_number_of_messages_visible_average{queue_name="prod-job-queue"}
/
clamp_min(rate(aws_sqs_number_of_messages_deleted_sum{queue_name="prod-job-queue"}[5m]), 0.01)
>= 900
# 消费者停滞:
#   最近 10 分钟有消息进入队列
#   但最近 10 分钟没有任何消息被 deleted
# 这通常表示 worker 停了、权限错误、代码卡住或 visibility timeout 设置异常。
increase(aws_sqs_number_of_messages_sent_sum{queue_name="prod-job-queue"}[10m]) > 0
and
increase(aws_sqs_number_of_messages_deleted_sum{queue_name="prod-job-queue"}[10m]) == 0

4. vmalert Rules#

groups:
  - name: sqs.rules
    rules:
      - alert: SQSOldestMessageTooOld
        # oldest message age 超过 300 秒,表示队列处理延迟已经超过默认 SLA。
        expr: aws_sqs_approximate_age_of_oldest_message_maximum >= 300
        for: 10m
        labels:
          severity: P1
          component: sqs
        annotations:
          summary: "SQS oldest message age is >= 300s"

      - alert: SQSDlqHasMessages
        # DLQ 中任何可见消息都需要告警,因为它代表消费失败已经发生。
        expr: aws_sqs_approximate_number_of_messages_visible_average{queue_name=~".*dlq.*"} >= 1
        for: 1m
        labels:
          severity: P1
          component: sqs
        annotations:
          summary: "SQS DLQ has visible messages"