AWS Secrets Manager Monitoring


https://docs.aws.amazon.com/secretsmanager/latest/userguide/monitoring.html
https://docs.aws.amazon.com/secretsmanager/latest/userguide/monitoring-eventbridge.html
https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity Alert Meaning Why Monitor Definition Duration
P1 Rotation failed secret 自动轮换失败 轮换失败会导致凭证老化,增加泄露和合规风险 EventBridge receives RotationFailed / failed rotation event event
P1 Secret scheduled deletion secret 被计划删除 防止关键凭证被误删导致应用不可用 DeleteSecret with recovery window event
P1 Secret policy changed secret resource policy 被修改或删除 发现访问边界被放宽、锁死或被误改 PutResourcePolicy / DeleteResourcePolicy event
P1 App cannot read secret 应用读取 secret 失败 直接影响应用启动、连接数据库或调用依赖 app metric secret_read_errors_total increased 5m
P2 Secret not rotated soon secret 接近应轮换时间但尚未轮换 提前发现轮换流程未执行或配置缺失 next_rotation_time - now < 7d 1h
P2 Access denied spike 读取 secret 的 AccessDenied 错误增加 发现 IAM/KMS/resource policy 改坏或异常访问 AccessDeniedException >= 5 from logs/app metric 10m

Secrets Manager 本身不是典型资源水位告警,核心是事件和应用访问失败。

2. EventBridge Rules#

删除、策略变更、轮换失败:

{
  "source": ["aws.secretsmanager"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventSource": ["secretsmanager.amazonaws.com"],
    "eventName": [
      "DeleteSecret",
      "PutResourcePolicy",
      "DeleteResourcePolicy",
      "RotateSecret",
      "CancelRotateSecret",
      "UpdateSecretVersionStage"
    ]
  }
}

建议把 EventBridge target 发到 SNS,再接 Lambda 转成 Alertmanager:

EventBridge -> SNS -> Lambda -> Alertmanager / direct IM fallback

3. Application Metrics#

Node.js 应用读取 secret 时暴露指标:

secret_read_total{service="api",secret_name="prod/db/password",result="success"}
secret_read_total{service="api",secret_name="prod/db/password",result="error",error="AccessDeniedException"}
secret_cache_hit_total{service="api",secret_name="prod/db/password"}
secret_rotation_epoch_seconds{secret_name="prod/db/password"}

PromQL:

# 最近 5 分钟应用读取 secret 出错。
# 这个 metric 来自应用埋点,不是 Secrets Manager 原生 CloudWatch 水位指标。
increase(secret_read_total{result="error"}[5m]) > 0
# 最近 10 分钟 AccessDeniedException 至少 5 次。
# 常见原因是 IAM policy、KMS key policy、secret resource policy 改错。
increase(secret_read_total{error="AccessDeniedException"}[10m]) >= 5
# secret 距离下一次 rotation 的剩余天数:
#   secret_next_rotation_epoch_seconds 是下一次轮换 Unix 时间
#   time() 是当前 Unix 时间
#   相减后 / 86400 转成天数
# < 7 表示 7 天内需要轮换。
(secret_next_rotation_epoch_seconds - time()) / 86400 < 7

4. vmalert Rules#

groups:
  - name: secrets-manager.rules
    rules:
      - alert: SecretReadErrors
        # 应用读取 secret 的错误次数在 5 分钟内增加。
        expr: increase(secret_read_total{result="error"}[5m]) > 0
        for: 5m
        labels:
          severity: P1
          component: secrets-manager
        annotations:
          summary: "Application failed to read secret"

      - alert: SecretRotationDueSoon
        # 下一次 rotation 距离当前时间少于 7 天。
        expr: (secret_next_rotation_epoch_seconds - time()) / 86400 < 7
        for: 1h
        labels:
          severity: P2
          component: secrets-manager
        annotations:
          summary: "Secret rotation is due in less than 7 days"