Links#
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/monitoring-using-cloudwatch.html
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/viewing-cloudfront-metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/1. Alert Standard#
| Severity | Alert | Meaning | Why Monitor | Definition | Duration |
|---|---|---|---|---|---|
| P1 | 5xx high | CloudFront 返回 5xx 比例过高 | 直接影响用户访问,通常指向 origin、网络或 CDN 配置问题 | 5xxErrorRate >= 5% |
5m |
| P1 | Total errors high | 4xx+5xx 总错误比例过高 | 发现用户访问失败、缓存规则或权限配置异常 | TotalErrorRate >= 10% |
5m |
| P1 | Origin latency high | CloudFront 到 origin 的响应延迟高 | 发现源站慢、网络路径异常或 origin 过载 | OriginLatency p95 >= 2s |
10m |
| P2 | Cache hit low | 可缓存内容的缓存命中率偏低 | 命中率低会增加 origin 压力、成本和用户延迟 | CacheHitRate < 70% for cacheable content |
30m |
| P2 | Requests drop | 请求量相对历史基线明显下降 | 发现 DNS、证书、路由或前端发布导致的流量异常 | current_5m < 50% of 1h average |
15m |
| P2 | TLS expiring | 证书距离过期时间过近 | 防止证书过期导致 HTTPS 访问失败 | < 14 days from blackbox exporter |
1h |
CloudFront metric 的 Region 固定查 us-east-1。
2. CloudWatch Metrics#
| Metric | Meaning | Why Monitor | Namespace | Statistic | Period |
|---|---|---|---|---|---|
| 5xxErrorRate | CloudFront 5xx 错误率 | 判断 CDN/origin 是否发生服务端失败 | AWS/CloudFront | Average | 60s |
| TotalErrorRate | CloudFront 总错误率 | 发现整体访问失败趋势,包括 4xx 和 5xx | AWS/CloudFront | Average | 60s |
| OriginLatency | CloudFront 从 origin 获取响应的延迟 | 判断源站性能和 CDN 回源链路健康 | AWS/CloudFront | p95 | 60s |
| CacheHitRate | 请求从 CloudFront 缓存命中的比例 | 判断缓存策略是否有效以及 origin 压力 | AWS/CloudFront | Average | 300s |
| Requests | CloudFront 接收的请求数 | 作为流量基线、错误率分母和异常下降检测 | AWS/CloudFront | Sum | 300s |
5xxErrorRate 查询:
[
{
"Id": "cf5xx",
"MetricStat": {
"Metric": {
"Namespace": "AWS/CloudFront",
"MetricName": "5xxErrorRate",
"Dimensions": [
{ "Name": "DistributionId", "Value": "E1234567890" },
{ "Name": "Region", "Value": "Global" }
]
},
"Period": 60,
"Stat": "Average"
},
"ReturnData": true
}
]aws cloudwatch get-metric-data \
--region us-east-1 \
--start-time 2026-06-02T00:00:00Z \
--end-time 2026-06-02T01:00:00Z \
--metric-data-queries file://cloudfront-5xx.jsonCloudWatch Alarm 条件:
metric: 5xxErrorRate
comparison: GreaterThanOrEqualToThreshold
threshold: 5
evaluation_periods: 5
datapoints_to_alarm: 5
period: 60
severity: P13. PromQL#
# 用正则先确认 YACE 暴露的 CloudFront metric 名称。
# CloudFront 的 CloudWatch metric 查询区域通常是 us-east-1。
{__name__=~"aws_cloudfront_.*(error|latency|cache|requests).*"}# CloudFront 5xx 错误率超过 5%。
# 5xx 通常来自 origin 或 CloudFront 到 origin 的连接问题。
aws_cloudfront_5xx_error_rate_average{distribution_id="E1234567890"} >= 5# 总错误率超过 10%,包含 4xx 和 5xx。
# 4xx 多时要区分是真实攻击/爬虫,还是鉴权、路径、缓存策略错误。
aws_cloudfront_total_error_rate_average{distribution_id="E1234567890"} >= 10# origin latency p95 超过 2 秒。
# 这里的单位按当前 exporter 暴露结果确认;如果是毫秒,则 2000 = 2s。
aws_cloudfront_origin_latency_p95{distribution_id="E1234567890"} >= 2000# CacheHitRate 低于 70%。
# 只适合 cacheable content;动态 API 不应该套这个阈值。
aws_cloudfront_cache_hit_rate_average{distribution_id="E1234567890"} < 70# blackbox 探测失败。
# 这是用户视角的可用性检查,不依赖 CloudFront 自身 metric。
probe_success{job="blackbox-http",instance="https://www.example.com"} == 0# TLS 证书剩余天数:
# probe_ssl_earliest_cert_expiry 是最早过期证书的 Unix 时间
# time() 是当前 Unix 时间
# 相减后 / 86400 转成天数
# < 14 表示证书 14 天内过期。
(probe_ssl_earliest_cert_expiry{instance="https://www.example.com"} - time()) / 86400 < 144. vmalert Rules#
groups:
- name: cloudfront.rules
rules:
- alert: CloudFront5xxErrorRateHigh
# CloudFront 5xxErrorRate 超过 5%。
expr: aws_cloudfront_5xx_error_rate_average >= 5
for: 5m
labels:
severity: P1
component: cloudfront
annotations:
summary: "CloudFront 5xxErrorRate is >= 5%"
- alert: CloudFrontOriginLatencyHigh
# origin latency p95 超过 2 秒;如果 exporter 单位是 seconds,要把阈值改成 2。
expr: aws_cloudfront_origin_latency_p95 >= 2000
for: 10m
labels:
severity: P1
component: cloudfront
annotations:
summary: "CloudFront OriginLatency p95 is >= 2s"