Links#
https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudwatch-monitoring.html
https://docs.aws.amazon.com/AmazonS3/latest/userguide/metrics-dimensions.html
https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-cloudwatch-metrics.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/1. Alert Standard#
| Severity | Alert | Meaning | Why Monitor | Definition | Duration |
|---|---|---|---|---|---|
| P1 | 5xx request rate high | S3 服务端错误率高 | 影响对象读写可用性,可能需要重试、降级或联系 AWS 支持 | 5xxErrors / AllRequests >= 1% and 5xxErrors >= 10 |
10m |
| P2 | 4xx request rate high | 客户端错误率高 | 发现权限、路径、签名、对象不存在或发布配置问题 | 4xxErrors / AllRequests >= 20% and 4xxErrors >= 100 |
15m |
| P2 | Storage growth abnormal | 存储量增长明显超出历史基线 | 发现异常写入、日志爆量、生命周期失效和成本风险 | daily growth > 2 * avg daily growth of last 7d |
1d |
| P1 | Public access opened | bucket 公共访问相关配置被打开 | 防止数据泄露和错误 bucket policy 暴露 | CloudTrail/EventBridge detects bucket policy or public access change | event |
| P1 | Lifecycle removed | lifecycle 规则被删除或修改 | 防止对象无法过期/转储导致成本和合规风险 | CloudTrail/EventBridge detects lifecycle delete/change | event |
S3 request metrics 默认不是所有 bucket 自动开启;需要对 bucket 或 prefix/filter 开启 request metrics。
2. CloudWatch Metrics#
| Metric | Meaning | Why Monitor | Namespace | Statistic | Period |
|---|---|---|---|---|---|
| AllRequests | bucket/prefix 的请求总数 | 作为流量基线和错误率分母 | AWS/S3 | Sum | 300s |
| 4xxErrors | S3 返回的客户端错误数 | 发现权限、签名、路径或对象不存在问题 | AWS/S3 | Sum | 300s |
| 5xxErrors | S3 返回的服务端错误数 | 发现 S3 服务侧或请求链路可用性问题 | AWS/S3 | Sum | 300s |
| FirstByteLatency | S3 开始返回响应的延迟 | 判断 S3 首字节性能和用户读取体验 | AWS/S3 | p95 | 300s |
| TotalRequestLatency | S3 完整请求处理延迟 | 判断端到端对象请求耗时和性能退化 | AWS/S3 | p95 | 300s |
| BucketSizeBytes | bucket 存储总字节数 | 监控容量增长、成本和生命周期效果 | AWS/S3 | Average | 86400s |
| NumberOfObjects | bucket 对象数量 | 发现对象数量异常增长和生命周期/清理问题 | AWS/S3 | Average | 86400s |
5xx rate metric math:
[
{
"Id": "allreq",
"MetricStat": {
"Metric": {
"Namespace": "AWS/S3",
"MetricName": "AllRequests",
"Dimensions": [
{ "Name": "BucketName", "Value": "prod-assets" },
{ "Name": "FilterId", "Value": "EntireBucket" }
]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "e5xx",
"MetricStat": {
"Metric": {
"Namespace": "AWS/S3",
"MetricName": "5xxErrors",
"Dimensions": [
{ "Name": "BucketName", "Value": "prod-assets" },
{ "Name": "FilterId", "Value": "EntireBucket" }
]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "s3_5xx_rate",
"Expression": "IF(allreq>0,100*e5xx/allreq,0)",
"Label": "S3 5xx percent",
"ReturnData": true
}
]3. PromQL#
# 用正则先确认 YACE 暴露的 S3 metric 名称。
# S3 request metrics 需要先在 bucket 或 prefix/filter 上开启。
{__name__=~"aws_s3_.*(requests|errors|latency|bucket|objects).*"}# 计算 10 分钟 S3 5xx 错误率:
# 分子 = 10 分钟内 5xxErrors 增量
# 分母 = 10 分钟内 AllRequests 增量
# clamp_min(..., 1) 防止无请求时除以 0
# 同时要求 5xx 数量 >= 10,避免极低流量下 1 个错误就触发。
100 *
increase(aws_s3_5xx_errors_sum{bucket_name="prod-assets"}[10m])
/
clamp_min(increase(aws_s3_all_requests_sum{bucket_name="prod-assets"}[10m]), 1)
>= 1
and
increase(aws_s3_5xx_errors_sum{bucket_name="prod-assets"}[10m]) >= 10# 计算 bucket 单日增长是否异常:
# 当前 BucketSizeBytes - 1 天前 BucketSizeBytes = 今天增长量
# avg_over_time(...[7d:1d]) = 过去 7 天每日增长量的平均值
# 当前增长量 > 过去 7 天平均增长量的 2 倍,认为增长异常。
(
aws_s3_bucket_size_bytes_average{bucket_name="prod-assets"}
-
aws_s3_bucket_size_bytes_average{bucket_name="prod-assets"} offset 1d
)
>
2 *
avg_over_time(
(aws_s3_bucket_size_bytes_average{bucket_name="prod-assets"} - aws_s3_bucket_size_bytes_average{bucket_name="prod-assets"} offset 1d)[7d:1d]
)4. EventBridge Security Rules#
Public access / bucket policy 变更用 CloudTrail 事件,不用指标猜:
{
"source": ["aws.s3"],
"detail-type": ["AWS API Call via CloudTrail"],
"detail": {
"eventSource": ["s3.amazonaws.com"],
"eventName": [
"PutBucketPolicy",
"DeleteBucketPolicy",
"PutBucketPublicAccessBlock",
"DeletePublicAccessBlock",
"PutBucketAcl",
"PutBucketLifecycleConfiguration",
"DeleteBucketLifecycle"
]
}
}