Links#
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/1. Alert Standard#
| Severity | Alert | Meaning | Why Monitor | Definition | Duration |
|---|---|---|---|---|---|
| P0 | No healthy target | target group 没有任何健康后端 | ALB 无法转发到可用实例,通常是服务不可用 | HealthyHostCount == 0 |
2m |
| P1 | Target 5xx rate high | 后端 target 返回 5xx 的比例过高 | 直接反映应用、依赖或发布引起的服务错误 | Target5xxRate >= 5% |
5m |
| P1 | ALB 5xx exists | ALB 自身生成 5xx 响应 | 常见于无健康 target、连接失败或协议问题 | HTTPCode_ELB_5XX_Count >= 10 in 5m |
5m |
| P1 | Target p95 latency high | 后端响应时间 p95 升高 | 发现用户请求变慢、依赖变慢或 target 资源不足 | TargetResponseTime p95 >= 1s |
10m |
| P2 | Rejected connections | ALB 拒绝新连接 | 可能达到连接/LCU 限制或瞬时压力过高 | RejectedConnectionCount > 0 |
5m |
| P2 | Target connection errors | ALB 连接 target 失败 | 发现安全组、端口、target 崩溃或网络路径问题 | TargetConnectionErrorCount > 0 |
5m |
请求量异常下降:
current_request_rate_5m < 0.5 * avg_over_time(request_rate[1h])
for 15m2. CloudWatch Metrics#
| Metric | Meaning | Why Monitor | Namespace | Statistic | Unit |
|---|---|---|---|---|---|
| HealthyHostCount | 健康 target 数量 | 判断 ALB 是否还有可转发后端 | AWS/ApplicationELB | Minimum | Count |
| RequestCount | ALB 接收的请求数 | 作为流量基线和错误率分母 | AWS/ApplicationELB | Sum | Count |
| HTTPCode_Target_5XX_Count | target 返回的 5xx 数量 | 定位应用或后端依赖错误 | AWS/ApplicationELB | Sum | Count |
| HTTPCode_ELB_5XX_Count | ALB 生成的 5xx 数量 | 定位负载均衡器到后端链路或配置问题 | AWS/ApplicationELB | Sum | Count |
| TargetResponseTime | target 从接收请求到开始响应的时间 | 衡量后端延迟和用户体验风险 | AWS/ApplicationELB | p95 | Seconds |
Target 5xx rate metric math:
[
{
"Id": "req",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "RequestCount",
"Dimensions": [
{ "Name": "LoadBalancer", "Value": "app/prod-api/abc123" },
{ "Name": "TargetGroup", "Value": "targetgroup/prod-api/def456" }
]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "e5xx",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "HTTPCode_Target_5XX_Count",
"Dimensions": [
{ "Name": "LoadBalancer", "Value": "app/prod-api/abc123" },
{ "Name": "TargetGroup", "Value": "targetgroup/prod-api/def456" }
]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "target_5xx_rate",
"Expression": "IF(req>0,100*e5xx/req,0)",
"Label": "Target 5xx percent",
"ReturnData": true
}
]CloudWatch Alarm 条件:
metric math id: target_5xx_rate
comparison: GreaterThanOrEqualToThreshold
threshold: 5
evaluation_periods: 1
datapoints_to_alarm: 1
period: 300
severity: P13. PromQL#
先确认 YACE metric:
# 用正则先确认 YACE 暴露的 ALB metric 名称。
# 不同 exporter 配置可能把 label 写成 load_balancer、load_balancer_name 或 target_group。
{__name__=~"aws_applicationelb_.*"}# target group 健康目标数为 0,表示 ALB 已经没有可转发的后端。
# 这是可用性告警,通常应该是 P0。
aws_applicationelb_healthy_host_count_minimum{load_balancer="app/prod-api/abc123",target_group="targetgroup/prod-api/def456"} == 0# 计算 ALB target 5xx 错误率:
# 分子 = target 返回的 5xx 数量
# 分母 = ALB 总请求数
# clamp_min(..., 1) 防止低流量或 0 请求时除以 0
# 结果 >= 5 表示 target 5xx rate 达到 5%。
100 *
aws_applicationelb_httpcode_target_5xx_count_sum{load_balancer="app/prod-api/abc123"}
/
clamp_min(aws_applicationelb_request_count_sum{load_balancer="app/prod-api/abc123"}, 1)
>= 5# target 响应时间 p95 超过 1 秒。
# 这个看的是后端响应慢,不包含完整用户网络链路。
aws_applicationelb_target_response_time_p95{load_balancer="app/prod-api/abc123"} >= 14. vmalert Rules#
groups:
- name: alb.rules
rules:
- alert: ALBNoHealthyTargets
# 没有 healthy target 时,ALB 无法正常把请求转发到后端。
expr: aws_applicationelb_healthy_host_count_minimum == 0
for: 2m
labels:
severity: P0
component: alb
annotations:
summary: "ALB has no healthy target"
- alert: ALBTarget5xxRateHigh
# target 5xx rate = target 5xx count / request count * 100。
# clamp_min 避免 RequestCount 为 0 时表达式异常。
expr: |
100 * aws_applicationelb_httpcode_target_5xx_count_sum
/
clamp_min(aws_applicationelb_request_count_sum, 1) >= 5
for: 5m
labels:
severity: P1
component: alb
annotations:
summary: "ALB target 5xx rate is >= 5%"