ALB Monitoring

Links#

1. Important Points#

ALB 告警要区分三类问题:

availability:
    no healthy target
    high target 5xx rate
    ALB 5xx
    rejected connections

performance:
    high target response time
    high target connection errors
    unusual request drop

security / traffic:
    WAF blocked request spike
    fixed-response 4xx spike
    HTTPCode_ELB_4XX_Count spike
    suspicious source IP from access logs

告警原则:

use rate for errors:
    target 5xx count alone is noisy
    target 5xx / request count is better

use absolute count for infrastructure failure:
    HealthyHostCount == 0
    RejectedConnectionCount > 0

use p95/p99 for latency:
    average hides tail latency

low traffic service:
    combine rate and minimum request count
    otherwise 1 failed request can produce 100% error rate

2. Must-have Alarms#

Severity	Alarm	Meaning	Why Monitor	Metric / Expression	Threshold	Duration
P0	No healthy target	target group 没有健康 target	ALB 无法把流量转发到可用后端	`HealthyHostCount Minimum`	`< 1`	2m
P1	Target 5xx rate high	target 返回 5xx 的比例过高	发现应用、依赖或发布导致的用户请求失败	`100 * Target5xx / RequestCount`	`>= 5%`	5m
P1	ALB 5xx exists	ALB 自身生成 5xx	发现无健康 target、连接失败或协议问题	`HTTPCode_ELB_5XX_Count Sum`	`>= 10`	5m
P1	Target p95 latency high	target 响应 p95 延迟高	发现后端性能退化和用户体验风险	`TargetResponseTime p95`	`>= 1s`	10m
P2	Rejected connections	ALB 拒绝连接	发现连接/LCU 压力或突发流量超限	`RejectedConnectionCount Sum`	`> 0`	5m
P2	Target connection errors	ALB 连接 target 失败	发现安全组、端口、target 崩溃或网络路径问题	`TargetConnectionErrorCount Sum`	`> 0`	5m
P2	Unhealthy target exists	至少存在不健康 target	提前发现部分实例故障，避免容量继续下降	`UnHealthyHostCount Maximum`	`> 0`	5m
P3	Traffic abnormal drop	请求量相对历史基线异常下降	发现 DNS、路由、证书、发布或上游流量异常	anomaly detection / baseline	env-specific	15m

Recommended dimensions:

LoadBalancer:
    app/prod-public-alb/abc123

TargetGroup:
    targetgroup/prod-order-api/def456

AvailabilityZone:
    use only when debugging AZ-specific issue
    most service alarms should not include AZ dimension

Create one topic per alert route:

aws sns create-topic \
  --name prod-critical-alerts \
  --region ap-east-1

aws sns subscribe \
  --topic-arn arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --protocol email \
  --notification-endpoint sre@example.com \
  --region ap-east-1

For ChatOps, route SNS to:

common options:
    AWS Chatbot to Slack
    Lambda webhook relay
    PagerDuty / Opsgenie integration
    EventBridge rule to incident workflow

4. No Healthy Target Alarm#

This is the highest value ALB alarm. If no healthy target exists, ALB cannot send traffic to the service.

aws cloudwatch put-metric-alarm \
  --alarm-name "P0-ALB-prod-order-api-no-healthy-target" \
  --alarm-description "ALB target group has no healthy target" \
  --namespace AWS/ApplicationELB \
  --metric-name HealthyHostCount \
  --dimensions \
      Name=TargetGroup,Value=targetgroup/prod-order-api/def456 \
      Name=LoadBalancer,Value=app/prod-public-alb/abc123 \
  --statistic Minimum \
  --period 60 \
  --evaluation-periods 2 \
  --datapoints-to-alarm 2 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --treat-missing-data breaching \
  --alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --ok-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --region ap-east-1

Verify:

aws cloudwatch describe-alarms \
  --alarm-names "P0-ALB-prod-order-api-no-healthy-target" \
  --region ap-east-1

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:ap-east-1:111122223333:targetgroup/prod-order-api/def456 \
  --region ap-east-1

5. Target 5xx Rate Alarm#

Target 5xx means backend returned 5xx. This usually points to application, dependency, deploy, or capacity issue.

aws cloudwatch put-metric-alarm \
  --alarm-name "P1-ALB-prod-order-api-target-5xx-rate" \
  --alarm-description "Target 5xx rate is >= 5% for 5 minutes" \
  --evaluation-periods 1 \
  --datapoints-to-alarm 1 \
  --threshold 5 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --metrics '[
    {
      "Id": "req",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "RequestCount",
          "Dimensions": [
            {"Name": "LoadBalancer", "Value": "app/prod-public-alb/abc123"},
            {"Name": "TargetGroup", "Value": "targetgroup/prod-order-api/def456"}
          ]
        },
        "Period": 300,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "e5xx",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "HTTPCode_Target_5XX_Count",
          "Dimensions": [
            {"Name": "LoadBalancer", "Value": "app/prod-public-alb/abc123"},
            {"Name": "TargetGroup", "Value": "targetgroup/prod-order-api/def456"}
          ]
        },
        "Period": 300,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "target_5xx_rate",
      "Expression": "IF(req>=100,100*e5xx/req,0)",
      "Label": "Target 5xx percent",
      "ReturnData": true
    }
  ]' \
  --alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --ok-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --region ap-east-1

Why req>=100:

low traffic service:
    1 request and 1 failure = 100%
    this is often not P1

high enough sample:
    only alert when enough requests happened in the period
    tune 100 based on service traffic

6. ALB 5xx Alarm#

ALB 5xx means the load balancer itself could not successfully handle or forward the request. Common causes are no healthy target, target connection problem, TLS/HTTP protocol problem, or internal ALB error.

aws cloudwatch put-metric-alarm \
  --alarm-name "P1-ALB-prod-public-alb-elb-5xx" \
  --alarm-description "ALB generated 5xx responses" \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_ELB_5XX_Count \
  --dimensions Name=LoadBalancer,Value=app/prod-public-alb/abc123 \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 1 \
  --datapoints-to-alarm 1 \
  --threshold 10 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --region ap-east-1

7. Latency Alarm#

Use TargetResponseTime p95 for backend response latency. This is not full browser latency; it measures the time from when ALB sends request to target until target starts sending response.

aws cloudwatch put-metric-alarm \
  --alarm-name "P1-ALB-prod-order-api-target-p95-latency" \
  --alarm-description "Target response time p95 is high" \
  --namespace AWS/ApplicationELB \
  --metric-name TargetResponseTime \
  --dimensions \
      Name=TargetGroup,Value=targetgroup/prod-order-api/def456 \
      Name=LoadBalancer,Value=app/prod-public-alb/abc123 \
  --extended-statistic p95 \
  --period 60 \
  --evaluation-periods 10 \
  --datapoints-to-alarm 8 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --region ap-east-1

Tune by service:

public API:
    p95 >= 1s for 10m often P1/P2

internal API:
    p95 threshold should follow SLO

worker callback / webhook:
    threshold may be higher, but should still be explicit

8. WAF Blocked Request Alarm#

If WAF is attached to ALB, alert on blocked request spike. WAF metrics are in AWS/WAFV2.

aws cloudwatch put-metric-alarm \
  --alarm-name "P2-WAF-prod-public-alb-blocked-spike" \
  --alarm-description "WAF blocked requests increased" \
  --namespace AWS/WAFV2 \
  --metric-name BlockedRequests \
  --dimensions \
      Name=WebACL,Value=prod-public-alb-web-acl \
      Name=Rule,Value=ALL \
      Name=Region,Value=ap-east-1 \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 2 \
  --datapoints-to-alarm 2 \
  --threshold 1000 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-security-alerts \
  --region ap-east-1

Interpretation:

blocked spike can mean:
    real attack
    crawler / scanner
    false positive after rule change
    broken client behavior

next check:
    WAF sampled requests
    WAF logs
    ALB access logs by source IP / path / user-agent

9. EventBridge For Alarm State#

Route CloudWatch alarm state changes to automation:

{
  "source": ["aws.cloudwatch"],
  "detail-type": ["CloudWatch Alarm State Change"],
  "detail": {
    "state": {
      "value": ["ALARM"]
    },
    "alarmName": [{
      "prefix": "P"
    }]
  }
}

Create rule:

aws events put-rule \
  --name cloudwatch-alarm-to-incident \
  --event-pattern file://cloudwatch-alarm-event-pattern.json \
  --region ap-east-1

Common targets:

target:
    SNS topic
    Lambda incident router
    Step Functions runbook
    SSM Automation

10. Access Log Investigation#

Access logs are delivered to S3. Use Athena for incident queries.

Typical questions:

what changed:
    5xx by target
    4xx by path
    top source IP
    top user-agent
    WAF blocked IP and path correlation
    latency by target group

Example Athena query after creating the ALB access log table:

SELECT
  request_url,
  target_status_code,
  elb_status_code,
  COUNT(*) AS requests
FROM alb_access_logs
WHERE day = '2026/06/03'
  AND elb = 'app/prod-public-alb/abc123'
  AND (elb_status_code LIKE '5%' OR target_status_code LIKE '5%')
GROUP BY request_url, target_status_code, elb_status_code
ORDER BY requests DESC
LIMIT 50;

Top source IP:

SELECT
  client_ip,
  COUNT(*) AS requests,
  SUM(CASE WHEN elb_status_code LIKE '4%' THEN 1 ELSE 0 END) AS elb_4xx,
  SUM(CASE WHEN elb_status_code LIKE '5%' THEN 1 ELSE 0 END) AS elb_5xx
FROM alb_access_logs
WHERE day = '2026/06/03'
GROUP BY client_ip
ORDER BY requests DESC
LIMIT 50;

11. Dashboard#

Dashboard widgets:

traffic:
    RequestCount
    ActiveConnectionCount
    NewConnectionCount

errors:
    HTTPCode_ELB_4XX_Count
    HTTPCode_ELB_5XX_Count
    HTTPCode_Target_4XX_Count
    HTTPCode_Target_5XX_Count
    target 5xx rate metric math

latency:
    TargetResponseTime p50 / p95 / p99

health:
    HealthyHostCount
    UnHealthyHostCount
    TargetConnectionErrorCount

security:
    WAF AllowedRequests
    WAF BlockedRequests
    WAF CountedRequests

12. Incident Checklist#

no healthy target:
    describe-target-health
    check ECS/EC2 deployment
    check app readiness endpoint
    check target SG allows ALB SG
    check container/instance logs

target 5xx:
    compare deployment time
    inspect app logs by trace/request id
    check dependency errors
    check target CPU/memory/database

ALB 5xx:
    check target connection errors
    check healthy host count
    check TLS/protocol mismatch
    inspect ALB access logs

latency:
    check target p95/p99
    check app dependency latency
    check target saturation
    check database and downstream services

WAF blocked spike:
    inspect WAF logs and sampled requests
    compare recent WAF rule change
    check source IP / country / path
    decide block, allowlist, or rule tuning