AWS S3 Monitoring


https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudwatch-monitoring.html
https://docs.aws.amazon.com/AmazonS3/latest/userguide/metrics-dimensions.html
https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-cloudwatch-metrics.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity Alert Meaning Why Monitor Definition Duration
P1 5xx request rate high S3 服务端错误率高 影响对象读写可用性,可能需要重试、降级或联系 AWS 支持 5xxErrors / AllRequests >= 1% and 5xxErrors >= 10 10m
P2 4xx request rate high 客户端错误率高 发现权限、路径、签名、对象不存在或发布配置问题 4xxErrors / AllRequests >= 20% and 4xxErrors >= 100 15m
P2 Storage growth abnormal 存储量增长明显超出历史基线 发现异常写入、日志爆量、生命周期失效和成本风险 daily growth > 2 * avg daily growth of last 7d 1d
P1 Public access opened bucket 公共访问相关配置被打开 防止数据泄露和错误 bucket policy 暴露 CloudTrail/EventBridge detects bucket policy or public access change event
P1 Lifecycle removed lifecycle 规则被删除或修改 防止对象无法过期/转储导致成本和合规风险 CloudTrail/EventBridge detects lifecycle delete/change event

S3 request metrics 默认不是所有 bucket 自动开启;需要对 bucket 或 prefix/filter 开启 request metrics。

2. CloudWatch Metrics#

Metric Meaning Why Monitor Namespace Statistic Period
AllRequests bucket/prefix 的请求总数 作为流量基线和错误率分母 AWS/S3 Sum 300s
4xxErrors S3 返回的客户端错误数 发现权限、签名、路径或对象不存在问题 AWS/S3 Sum 300s
5xxErrors S3 返回的服务端错误数 发现 S3 服务侧或请求链路可用性问题 AWS/S3 Sum 300s
FirstByteLatency S3 开始返回响应的延迟 判断 S3 首字节性能和用户读取体验 AWS/S3 p95 300s
TotalRequestLatency S3 完整请求处理延迟 判断端到端对象请求耗时和性能退化 AWS/S3 p95 300s
BucketSizeBytes bucket 存储总字节数 监控容量增长、成本和生命周期效果 AWS/S3 Average 86400s
NumberOfObjects bucket 对象数量 发现对象数量异常增长和生命周期/清理问题 AWS/S3 Average 86400s

5xx rate metric math:

[
  {
    "Id": "allreq",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/S3",
        "MetricName": "AllRequests",
        "Dimensions": [
          { "Name": "BucketName", "Value": "prod-assets" },
          { "Name": "FilterId", "Value": "EntireBucket" }
        ]
      },
      "Period": 300,
      "Stat": "Sum"
    },
    "ReturnData": false
  },
  {
    "Id": "e5xx",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/S3",
        "MetricName": "5xxErrors",
        "Dimensions": [
          { "Name": "BucketName", "Value": "prod-assets" },
          { "Name": "FilterId", "Value": "EntireBucket" }
        ]
      },
      "Period": 300,
      "Stat": "Sum"
    },
    "ReturnData": false
  },
  {
    "Id": "s3_5xx_rate",
    "Expression": "IF(allreq>0,100*e5xx/allreq,0)",
    "Label": "S3 5xx percent",
    "ReturnData": true
  }
]

3. PromQL#

# 用正则先确认 YACE 暴露的 S3 metric 名称。
# S3 request metrics 需要先在 bucket 或 prefix/filter 上开启。
{__name__=~"aws_s3_.*(requests|errors|latency|bucket|objects).*"}
# 计算 10 分钟 S3 5xx 错误率:
#   分子 = 10 分钟内 5xxErrors 增量
#   分母 = 10 分钟内 AllRequests 增量
#   clamp_min(..., 1) 防止无请求时除以 0
# 同时要求 5xx 数量 >= 10,避免极低流量下 1 个错误就触发。
100 *
increase(aws_s3_5xx_errors_sum{bucket_name="prod-assets"}[10m])
/
clamp_min(increase(aws_s3_all_requests_sum{bucket_name="prod-assets"}[10m]), 1)
>= 1
and
increase(aws_s3_5xx_errors_sum{bucket_name="prod-assets"}[10m]) >= 10
# 计算 bucket 单日增长是否异常:
#   当前 BucketSizeBytes - 1 天前 BucketSizeBytes = 今天增长量
#   avg_over_time(...[7d:1d]) = 过去 7 天每日增长量的平均值
# 当前增长量 > 过去 7 天平均增长量的 2 倍,认为增长异常。
(
  aws_s3_bucket_size_bytes_average{bucket_name="prod-assets"}
  -
  aws_s3_bucket_size_bytes_average{bucket_name="prod-assets"} offset 1d
)
>
2 *
avg_over_time(
  (aws_s3_bucket_size_bytes_average{bucket_name="prod-assets"} - aws_s3_bucket_size_bytes_average{bucket_name="prod-assets"} offset 1d)[7d:1d]
)

4. EventBridge Security Rules#

Public access / bucket policy 变更用 CloudTrail 事件,不用指标猜:

{
  "source": ["aws.s3"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventSource": ["s3.amazonaws.com"],
    "eventName": [
      "PutBucketPolicy",
      "DeleteBucketPolicy",
      "PutBucketPublicAccessBlock",
      "DeletePublicAccessBlock",
      "PutBucketAcl",
      "PutBucketLifecycleConfiguration",
      "DeleteBucketLifecycle"
    ]
  }
}