Links#
https://docs.aws.amazon.com/secretsmanager/latest/userguide/monitoring.html
https://docs.aws.amazon.com/secretsmanager/latest/userguide/monitoring-eventbridge.html
https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/1. Alert Standard#
| Severity | Alert | Meaning | Why Monitor | Definition | Duration |
|---|---|---|---|---|---|
| P1 | Rotation failed | secret 自动轮换失败 | 轮换失败会导致凭证老化,增加泄露和合规风险 | EventBridge receives RotationFailed / failed rotation event |
event |
| P1 | Secret scheduled deletion | secret 被计划删除 | 防止关键凭证被误删导致应用不可用 | DeleteSecret with recovery window |
event |
| P1 | Secret policy changed | secret resource policy 被修改或删除 | 发现访问边界被放宽、锁死或被误改 | PutResourcePolicy / DeleteResourcePolicy |
event |
| P1 | App cannot read secret | 应用读取 secret 失败 | 直接影响应用启动、连接数据库或调用依赖 | app metric secret_read_errors_total increased |
5m |
| P2 | Secret not rotated soon | secret 接近应轮换时间但尚未轮换 | 提前发现轮换流程未执行或配置缺失 | next_rotation_time - now < 7d |
1h |
| P2 | Access denied spike | 读取 secret 的 AccessDenied 错误增加 | 发现 IAM/KMS/resource policy 改坏或异常访问 | AccessDeniedException >= 5 from logs/app metric |
10m |
Secrets Manager 本身不是典型资源水位告警,核心是事件和应用访问失败。
2. EventBridge Rules#
删除、策略变更、轮换失败:
{
"source": ["aws.secretsmanager"],
"detail-type": ["AWS API Call via CloudTrail"],
"detail": {
"eventSource": ["secretsmanager.amazonaws.com"],
"eventName": [
"DeleteSecret",
"PutResourcePolicy",
"DeleteResourcePolicy",
"RotateSecret",
"CancelRotateSecret",
"UpdateSecretVersionStage"
]
}
}建议把 EventBridge target 发到 SNS,再接 Lambda 转成 Alertmanager:
EventBridge -> SNS -> Lambda -> Alertmanager / direct IM fallback3. Application Metrics#
Node.js 应用读取 secret 时暴露指标:
secret_read_total{service="api",secret_name="prod/db/password",result="success"}
secret_read_total{service="api",secret_name="prod/db/password",result="error",error="AccessDeniedException"}
secret_cache_hit_total{service="api",secret_name="prod/db/password"}
secret_rotation_epoch_seconds{secret_name="prod/db/password"}PromQL:
# 最近 5 分钟应用读取 secret 出错。
# 这个 metric 来自应用埋点,不是 Secrets Manager 原生 CloudWatch 水位指标。
increase(secret_read_total{result="error"}[5m]) > 0# 最近 10 分钟 AccessDeniedException 至少 5 次。
# 常见原因是 IAM policy、KMS key policy、secret resource policy 改错。
increase(secret_read_total{error="AccessDeniedException"}[10m]) >= 5# secret 距离下一次 rotation 的剩余天数:
# secret_next_rotation_epoch_seconds 是下一次轮换 Unix 时间
# time() 是当前 Unix 时间
# 相减后 / 86400 转成天数
# < 7 表示 7 天内需要轮换。
(secret_next_rotation_epoch_seconds - time()) / 86400 < 74. vmalert Rules#
groups:
- name: secrets-manager.rules
rules:
- alert: SecretReadErrors
# 应用读取 secret 的错误次数在 5 分钟内增加。
expr: increase(secret_read_total{result="error"}[5m]) > 0
for: 5m
labels:
severity: P1
component: secrets-manager
annotations:
summary: "Application failed to read secret"
- alert: SecretRotationDueSoon
# 下一次 rotation 距离当前时间少于 7 天。
expr: (secret_next_rotation_epoch_seconds - time()) / 86400 < 7
for: 1h
labels:
severity: P2
component: secrets-manager
annotations:
summary: "Secret rotation is due in less than 7 days"