AWS ALB Monitoring


https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity Alert Meaning Why Monitor Definition Duration
P0 No healthy target target group 没有任何健康后端 ALB 无法转发到可用实例,通常是服务不可用 HealthyHostCount == 0 2m
P1 Target 5xx rate high 后端 target 返回 5xx 的比例过高 直接反映应用、依赖或发布引起的服务错误 Target5xxRate >= 5% 5m
P1 ALB 5xx exists ALB 自身生成 5xx 响应 常见于无健康 target、连接失败或协议问题 HTTPCode_ELB_5XX_Count >= 10 in 5m 5m
P1 Target p95 latency high 后端响应时间 p95 升高 发现用户请求变慢、依赖变慢或 target 资源不足 TargetResponseTime p95 >= 1s 10m
P2 Rejected connections ALB 拒绝新连接 可能达到连接/LCU 限制或瞬时压力过高 RejectedConnectionCount > 0 5m
P2 Target connection errors ALB 连接 target 失败 发现安全组、端口、target 崩溃或网络路径问题 TargetConnectionErrorCount > 0 5m

请求量异常下降:

current_request_rate_5m < 0.5 * avg_over_time(request_rate[1h])
for 15m

2. CloudWatch Metrics#

Metric Meaning Why Monitor Namespace Statistic Unit
HealthyHostCount 健康 target 数量 判断 ALB 是否还有可转发后端 AWS/ApplicationELB Minimum Count
RequestCount ALB 接收的请求数 作为流量基线和错误率分母 AWS/ApplicationELB Sum Count
HTTPCode_Target_5XX_Count target 返回的 5xx 数量 定位应用或后端依赖错误 AWS/ApplicationELB Sum Count
HTTPCode_ELB_5XX_Count ALB 生成的 5xx 数量 定位负载均衡器到后端链路或配置问题 AWS/ApplicationELB Sum Count
TargetResponseTime target 从接收请求到开始响应的时间 衡量后端延迟和用户体验风险 AWS/ApplicationELB p95 Seconds

Target 5xx rate metric math:

[
  {
    "Id": "req",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/ApplicationELB",
        "MetricName": "RequestCount",
        "Dimensions": [
          { "Name": "LoadBalancer", "Value": "app/prod-api/abc123" },
          { "Name": "TargetGroup", "Value": "targetgroup/prod-api/def456" }
        ]
      },
      "Period": 300,
      "Stat": "Sum"
    },
    "ReturnData": false
  },
  {
    "Id": "e5xx",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/ApplicationELB",
        "MetricName": "HTTPCode_Target_5XX_Count",
        "Dimensions": [
          { "Name": "LoadBalancer", "Value": "app/prod-api/abc123" },
          { "Name": "TargetGroup", "Value": "targetgroup/prod-api/def456" }
        ]
      },
      "Period": 300,
      "Stat": "Sum"
    },
    "ReturnData": false
  },
  {
    "Id": "target_5xx_rate",
    "Expression": "IF(req>0,100*e5xx/req,0)",
    "Label": "Target 5xx percent",
    "ReturnData": true
  }
]

CloudWatch Alarm 条件:

metric math id: target_5xx_rate
comparison: GreaterThanOrEqualToThreshold
threshold: 5
evaluation_periods: 1
datapoints_to_alarm: 1
period: 300
severity: P1

3. PromQL#

先确认 YACE metric:

# 用正则先确认 YACE 暴露的 ALB metric 名称。
# 不同 exporter 配置可能把 label 写成 load_balancer、load_balancer_name 或 target_group。
{__name__=~"aws_applicationelb_.*"}
# target group 健康目标数为 0,表示 ALB 已经没有可转发的后端。
# 这是可用性告警,通常应该是 P0。
aws_applicationelb_healthy_host_count_minimum{load_balancer="app/prod-api/abc123",target_group="targetgroup/prod-api/def456"} == 0
# 计算 ALB target 5xx 错误率:
#   分子 = target 返回的 5xx 数量
#   分母 = ALB 总请求数
#   clamp_min(..., 1) 防止低流量或 0 请求时除以 0
# 结果 >= 5 表示 target 5xx rate 达到 5%。
100 *
aws_applicationelb_httpcode_target_5xx_count_sum{load_balancer="app/prod-api/abc123"}
/
clamp_min(aws_applicationelb_request_count_sum{load_balancer="app/prod-api/abc123"}, 1)
>= 5
# target 响应时间 p95 超过 1 秒。
# 这个看的是后端响应慢,不包含完整用户网络链路。
aws_applicationelb_target_response_time_p95{load_balancer="app/prod-api/abc123"} >= 1

4. vmalert Rules#

groups:
  - name: alb.rules
    rules:
      - alert: ALBNoHealthyTargets
        # 没有 healthy target 时,ALB 无法正常把请求转发到后端。
        expr: aws_applicationelb_healthy_host_count_minimum == 0
        for: 2m
        labels:
          severity: P0
          component: alb
        annotations:
          summary: "ALB has no healthy target"

      - alert: ALBTarget5xxRateHigh
        # target 5xx rate = target 5xx count / request count * 100。
        # clamp_min 避免 RequestCount 为 0 时表达式异常。
        expr: |
          100 * aws_applicationelb_httpcode_target_5xx_count_sum
          /
          clamp_min(aws_applicationelb_request_count_sum, 1) >= 5
        for: 5m
        labels:
          severity: P1
          component: alb
        annotations:
          summary: "ALB target 5xx rate is >= 5%"