AWS CloudFront Monitoring

Links#

https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/monitoring-using-cloudwatch.html
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/viewing-cloudfront-metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity	Alert	Meaning	Why Monitor	Definition	Duration
P1	5xx high	CloudFront 返回 5xx 比例过高	直接影响用户访问，通常指向 origin、网络或 CDN 配置问题	`5xxErrorRate >= 5%`	5m
P1	Total errors high	4xx+5xx 总错误比例过高	发现用户访问失败、缓存规则或权限配置异常	`TotalErrorRate >= 10%`	5m
P1	Origin latency high	CloudFront 到 origin 的响应延迟高	发现源站慢、网络路径异常或 origin 过载	`OriginLatency p95 >= 2s`	10m
P2	Cache hit low	可缓存内容的缓存命中率偏低	命中率低会增加 origin 压力、成本和用户延迟	`CacheHitRate < 70%` for cacheable content	30m
P2	Requests drop	请求量相对历史基线明显下降	发现 DNS、证书、路由或前端发布导致的流量异常	`current_5m < 50% of 1h average`	15m
P2	TLS expiring	证书距离过期时间过近	防止证书过期导致 HTTPS 访问失败	`< 14 days` from blackbox exporter	1h

CloudFront metric 的 Region 固定查 us-east-1。

2. CloudWatch Metrics#

Metric	Meaning	Why Monitor	Namespace	Statistic	Period
5xxErrorRate	CloudFront 5xx 错误率	判断 CDN/origin 是否发生服务端失败	AWS/CloudFront	Average	60s
TotalErrorRate	CloudFront 总错误率	发现整体访问失败趋势，包括 4xx 和 5xx	AWS/CloudFront	Average	60s
OriginLatency	CloudFront 从 origin 获取响应的延迟	判断源站性能和 CDN 回源链路健康	AWS/CloudFront	p95	60s
CacheHitRate	请求从 CloudFront 缓存命中的比例	判断缓存策略是否有效以及 origin 压力	AWS/CloudFront	Average	300s
Requests	CloudFront 接收的请求数	作为流量基线、错误率分母和异常下降检测	AWS/CloudFront	Sum	300s

5xxErrorRate 查询：

[
  {
    "Id": "cf5xx",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/CloudFront",
        "MetricName": "5xxErrorRate",
        "Dimensions": [
          { "Name": "DistributionId", "Value": "E1234567890" },
          { "Name": "Region", "Value": "Global" }
        ]
      },
      "Period": 60,
      "Stat": "Average"
    },
    "ReturnData": true
  }
]

aws cloudwatch get-metric-data \
  --region us-east-1 \
  --start-time 2026-06-02T00:00:00Z \
  --end-time 2026-06-02T01:00:00Z \
  --metric-data-queries file://cloudfront-5xx.json

CloudWatch Alarm 条件：

metric: 5xxErrorRate
comparison: GreaterThanOrEqualToThreshold
threshold: 5
evaluation_periods: 5
datapoints_to_alarm: 5
period: 60
severity: P1

3. PromQL#

# 用正则先确认 YACE 暴露的 CloudFront metric 名称。
# CloudFront 的 CloudWatch metric 查询区域通常是 us-east-1。
{__name__=~"aws_cloudfront_.*(error|latency|cache|requests).*"}

# CloudFront 5xx 错误率超过 5%。
# 5xx 通常来自 origin 或 CloudFront 到 origin 的连接问题。
aws_cloudfront_5xx_error_rate_average{distribution_id="E1234567890"} >= 5

# 总错误率超过 10%，包含 4xx 和 5xx。
# 4xx 多时要区分是真实攻击/爬虫，还是鉴权、路径、缓存策略错误。
aws_cloudfront_total_error_rate_average{distribution_id="E1234567890"} >= 10

# origin latency p95 超过 2 秒。
# 这里的单位按当前 exporter 暴露结果确认；如果是毫秒，则 2000 = 2s。
aws_cloudfront_origin_latency_p95{distribution_id="E1234567890"} >= 2000

# CacheHitRate 低于 70%。
# 只适合 cacheable content；动态 API 不应该套这个阈值。
aws_cloudfront_cache_hit_rate_average{distribution_id="E1234567890"} < 70

# blackbox 探测失败。
# 这是用户视角的可用性检查，不依赖 CloudFront 自身 metric。
probe_success{job="blackbox-http",instance="https://www.example.com"} == 0

# TLS 证书剩余天数:
#   probe_ssl_earliest_cert_expiry 是最早过期证书的 Unix 时间
#   time() 是当前 Unix 时间
#   相减后 / 86400 转成天数
# < 14 表示证书 14 天内过期。
(probe_ssl_earliest_cert_expiry{instance="https://www.example.com"} - time()) / 86400 < 14

4. vmalert Rules#

groups:
  - name: cloudfront.rules
    rules:
      - alert: CloudFront5xxErrorRateHigh
        # CloudFront 5xxErrorRate 超过 5%。
        expr: aws_cloudfront_5xx_error_rate_average >= 5
        for: 5m
        labels:
          severity: P1
          component: cloudfront
        annotations:
          summary: "CloudFront 5xxErrorRate is >= 5%"

      - alert: CloudFrontOriginLatencyHigh
        # origin latency p95 超过 2 秒；如果 exporter 单位是 seconds，要把阈值改成 2。
        expr: aws_cloudfront_origin_latency_p95 >= 2000
        for: 10m
        labels:
          severity: P1
          component: cloudfront
        annotations:
          summary: "CloudFront OriginLatency p95 is >= 2s"