AWS CloudFront Monitoring


https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/monitoring-using-cloudwatch.html
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/viewing-cloudfront-metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity Alert Meaning Why Monitor Definition Duration
P1 5xx high CloudFront 返回 5xx 比例过高 直接影响用户访问,通常指向 origin、网络或 CDN 配置问题 5xxErrorRate >= 5% 5m
P1 Total errors high 4xx+5xx 总错误比例过高 发现用户访问失败、缓存规则或权限配置异常 TotalErrorRate >= 10% 5m
P1 Origin latency high CloudFront 到 origin 的响应延迟高 发现源站慢、网络路径异常或 origin 过载 OriginLatency p95 >= 2s 10m
P2 Cache hit low 可缓存内容的缓存命中率偏低 命中率低会增加 origin 压力、成本和用户延迟 CacheHitRate < 70% for cacheable content 30m
P2 Requests drop 请求量相对历史基线明显下降 发现 DNS、证书、路由或前端发布导致的流量异常 current_5m < 50% of 1h average 15m
P2 TLS expiring 证书距离过期时间过近 防止证书过期导致 HTTPS 访问失败 < 14 days from blackbox exporter 1h

CloudFront metric 的 Region 固定查 us-east-1

2. CloudWatch Metrics#

Metric Meaning Why Monitor Namespace Statistic Period
5xxErrorRate CloudFront 5xx 错误率 判断 CDN/origin 是否发生服务端失败 AWS/CloudFront Average 60s
TotalErrorRate CloudFront 总错误率 发现整体访问失败趋势,包括 4xx 和 5xx AWS/CloudFront Average 60s
OriginLatency CloudFront 从 origin 获取响应的延迟 判断源站性能和 CDN 回源链路健康 AWS/CloudFront p95 60s
CacheHitRate 请求从 CloudFront 缓存命中的比例 判断缓存策略是否有效以及 origin 压力 AWS/CloudFront Average 300s
Requests CloudFront 接收的请求数 作为流量基线、错误率分母和异常下降检测 AWS/CloudFront Sum 300s

5xxErrorRate 查询:

[
  {
    "Id": "cf5xx",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/CloudFront",
        "MetricName": "5xxErrorRate",
        "Dimensions": [
          { "Name": "DistributionId", "Value": "E1234567890" },
          { "Name": "Region", "Value": "Global" }
        ]
      },
      "Period": 60,
      "Stat": "Average"
    },
    "ReturnData": true
  }
]
aws cloudwatch get-metric-data \
  --region us-east-1 \
  --start-time 2026-06-02T00:00:00Z \
  --end-time 2026-06-02T01:00:00Z \
  --metric-data-queries file://cloudfront-5xx.json

CloudWatch Alarm 条件:

metric: 5xxErrorRate
comparison: GreaterThanOrEqualToThreshold
threshold: 5
evaluation_periods: 5
datapoints_to_alarm: 5
period: 60
severity: P1

3. PromQL#

# 用正则先确认 YACE 暴露的 CloudFront metric 名称。
# CloudFront 的 CloudWatch metric 查询区域通常是 us-east-1。
{__name__=~"aws_cloudfront_.*(error|latency|cache|requests).*"}
# CloudFront 5xx 错误率超过 5%。
# 5xx 通常来自 origin 或 CloudFront 到 origin 的连接问题。
aws_cloudfront_5xx_error_rate_average{distribution_id="E1234567890"} >= 5
# 总错误率超过 10%,包含 4xx 和 5xx。
# 4xx 多时要区分是真实攻击/爬虫,还是鉴权、路径、缓存策略错误。
aws_cloudfront_total_error_rate_average{distribution_id="E1234567890"} >= 10
# origin latency p95 超过 2 秒。
# 这里的单位按当前 exporter 暴露结果确认;如果是毫秒,则 2000 = 2s。
aws_cloudfront_origin_latency_p95{distribution_id="E1234567890"} >= 2000
# CacheHitRate 低于 70%。
# 只适合 cacheable content;动态 API 不应该套这个阈值。
aws_cloudfront_cache_hit_rate_average{distribution_id="E1234567890"} < 70
# blackbox 探测失败。
# 这是用户视角的可用性检查,不依赖 CloudFront 自身 metric。
probe_success{job="blackbox-http",instance="https://www.example.com"} == 0
# TLS 证书剩余天数:
#   probe_ssl_earliest_cert_expiry 是最早过期证书的 Unix 时间
#   time() 是当前 Unix 时间
#   相减后 / 86400 转成天数
# < 14 表示证书 14 天内过期。
(probe_ssl_earliest_cert_expiry{instance="https://www.example.com"} - time()) / 86400 < 14

4. vmalert Rules#

groups:
  - name: cloudfront.rules
    rules:
      - alert: CloudFront5xxErrorRateHigh
        # CloudFront 5xxErrorRate 超过 5%。
        expr: aws_cloudfront_5xx_error_rate_average >= 5
        for: 5m
        labels:
          severity: P1
          component: cloudfront
        annotations:
          summary: "CloudFront 5xxErrorRate is >= 5%"

      - alert: CloudFrontOriginLatencyHigh
        # origin latency p95 超过 2 秒;如果 exporter 单位是 seconds,要把阈值改成 2。
        expr: aws_cloudfront_origin_latency_p95 >= 2000
        for: 10m
        labels:
          severity: P1
          component: cloudfront
        annotations:
          summary: "CloudFront OriginLatency p95 is >= 2s"