Links#
- CloudWatch metrics for Application Load Balancers
- Create a CloudWatch alarm based on a static threshold
- Using metric math
- aws cloudwatch put-metric-alarm
- Access logs for your Application Load Balancer
1. Important Points#
ALB 告警要区分三类问题:
availability:
no healthy target
high target 5xx rate
ALB 5xx
rejected connections
performance:
high target response time
high target connection errors
unusual request drop
security / traffic:
WAF blocked request spike
fixed-response 4xx spike
HTTPCode_ELB_4XX_Count spike
suspicious source IP from access logs告警原则:
use rate for errors:
target 5xx count alone is noisy
target 5xx / request count is better
use absolute count for infrastructure failure:
HealthyHostCount == 0
RejectedConnectionCount > 0
use p95/p99 for latency:
average hides tail latency
low traffic service:
combine rate and minimum request count
otherwise 1 failed request can produce 100% error rate2. Must-have Alarms#
| Severity | Alarm | Meaning | Why Monitor | Metric / Expression | Threshold | Duration |
|---|---|---|---|---|---|---|
| P0 | No healthy target | target group 没有健康 target | ALB 无法把流量转发到可用后端 | HealthyHostCount Minimum |
< 1 |
2m |
| P1 | Target 5xx rate high | target 返回 5xx 的比例过高 | 发现应用、依赖或发布导致的用户请求失败 | 100 * Target5xx / RequestCount |
>= 5% |
5m |
| P1 | ALB 5xx exists | ALB 自身生成 5xx | 发现无健康 target、连接失败或协议问题 | HTTPCode_ELB_5XX_Count Sum |
>= 10 |
5m |
| P1 | Target p95 latency high | target 响应 p95 延迟高 | 发现后端性能退化和用户体验风险 | TargetResponseTime p95 |
>= 1s |
10m |
| P2 | Rejected connections | ALB 拒绝连接 | 发现连接/LCU 压力或突发流量超限 | RejectedConnectionCount Sum |
> 0 |
5m |
| P2 | Target connection errors | ALB 连接 target 失败 | 发现安全组、端口、target 崩溃或网络路径问题 | TargetConnectionErrorCount Sum |
> 0 |
5m |
| P2 | Unhealthy target exists | 至少存在不健康 target | 提前发现部分实例故障,避免容量继续下降 | UnHealthyHostCount Maximum |
> 0 |
5m |
| P3 | Traffic abnormal drop | 请求量相对历史基线异常下降 | 发现 DNS、路由、证书、发布或上游流量异常 | anomaly detection / baseline | env-specific | 15m |
Recommended dimensions:
LoadBalancer:
app/prod-public-alb/abc123
TargetGroup:
targetgroup/prod-order-api/def456
AvailabilityZone:
use only when debugging AZ-specific issue
most service alarms should not include AZ dimension3. SNS Topic#
Create one topic per alert route:
aws sns create-topic \
--name prod-critical-alerts \
--region ap-east-1
aws sns subscribe \
--topic-arn arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--protocol email \
--notification-endpoint sre@example.com \
--region ap-east-1For ChatOps, route SNS to:
common options:
AWS Chatbot to Slack
Lambda webhook relay
PagerDuty / Opsgenie integration
EventBridge rule to incident workflow4. No Healthy Target Alarm#
This is the highest value ALB alarm. If no healthy target exists, ALB cannot send traffic to the service.
aws cloudwatch put-metric-alarm \
--alarm-name "P0-ALB-prod-order-api-no-healthy-target" \
--alarm-description "ALB target group has no healthy target" \
--namespace AWS/ApplicationELB \
--metric-name HealthyHostCount \
--dimensions \
Name=TargetGroup,Value=targetgroup/prod-order-api/def456 \
Name=LoadBalancer,Value=app/prod-public-alb/abc123 \
--statistic Minimum \
--period 60 \
--evaluation-periods 2 \
--datapoints-to-alarm 2 \
--threshold 1 \
--comparison-operator LessThanThreshold \
--treat-missing-data breaching \
--alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--ok-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--region ap-east-1Verify:
aws cloudwatch describe-alarms \
--alarm-names "P0-ALB-prod-order-api-no-healthy-target" \
--region ap-east-1
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:ap-east-1:111122223333:targetgroup/prod-order-api/def456 \
--region ap-east-15. Target 5xx Rate Alarm#
Target 5xx means backend returned 5xx. This usually points to application, dependency, deploy, or capacity issue.
aws cloudwatch put-metric-alarm \
--alarm-name "P1-ALB-prod-order-api-target-5xx-rate" \
--alarm-description "Target 5xx rate is >= 5% for 5 minutes" \
--evaluation-periods 1 \
--datapoints-to-alarm 1 \
--threshold 5 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--metrics '[
{
"Id": "req",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "RequestCount",
"Dimensions": [
{"Name": "LoadBalancer", "Value": "app/prod-public-alb/abc123"},
{"Name": "TargetGroup", "Value": "targetgroup/prod-order-api/def456"}
]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "e5xx",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "HTTPCode_Target_5XX_Count",
"Dimensions": [
{"Name": "LoadBalancer", "Value": "app/prod-public-alb/abc123"},
{"Name": "TargetGroup", "Value": "targetgroup/prod-order-api/def456"}
]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "target_5xx_rate",
"Expression": "IF(req>=100,100*e5xx/req,0)",
"Label": "Target 5xx percent",
"ReturnData": true
}
]' \
--alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--ok-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--region ap-east-1Why req>=100:
low traffic service:
1 request and 1 failure = 100%
this is often not P1
high enough sample:
only alert when enough requests happened in the period
tune 100 based on service traffic6. ALB 5xx Alarm#
ALB 5xx means the load balancer itself could not successfully handle or forward the request. Common causes are no healthy target, target connection problem, TLS/HTTP protocol problem, or internal ALB error.
aws cloudwatch put-metric-alarm \
--alarm-name "P1-ALB-prod-public-alb-elb-5xx" \
--alarm-description "ALB generated 5xx responses" \
--namespace AWS/ApplicationELB \
--metric-name HTTPCode_ELB_5XX_Count \
--dimensions Name=LoadBalancer,Value=app/prod-public-alb/abc123 \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--datapoints-to-alarm 1 \
--threshold 10 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--region ap-east-17. Latency Alarm#
Use TargetResponseTime p95 for backend response latency. This is not full browser latency; it measures the time from when ALB sends request to target until target starts sending response.
aws cloudwatch put-metric-alarm \
--alarm-name "P1-ALB-prod-order-api-target-p95-latency" \
--alarm-description "Target response time p95 is high" \
--namespace AWS/ApplicationELB \
--metric-name TargetResponseTime \
--dimensions \
Name=TargetGroup,Value=targetgroup/prod-order-api/def456 \
Name=LoadBalancer,Value=app/prod-public-alb/abc123 \
--extended-statistic p95 \
--period 60 \
--evaluation-periods 10 \
--datapoints-to-alarm 8 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--region ap-east-1Tune by service:
public API:
p95 >= 1s for 10m often P1/P2
internal API:
p95 threshold should follow SLO
worker callback / webhook:
threshold may be higher, but should still be explicit8. WAF Blocked Request Alarm#
If WAF is attached to ALB, alert on blocked request spike. WAF metrics are in AWS/WAFV2.
aws cloudwatch put-metric-alarm \
--alarm-name "P2-WAF-prod-public-alb-blocked-spike" \
--alarm-description "WAF blocked requests increased" \
--namespace AWS/WAFV2 \
--metric-name BlockedRequests \
--dimensions \
Name=WebACL,Value=prod-public-alb-web-acl \
Name=Rule,Value=ALL \
Name=Region,Value=ap-east-1 \
--statistic Sum \
--period 300 \
--evaluation-periods 2 \
--datapoints-to-alarm 2 \
--threshold 1000 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-security-alerts \
--region ap-east-1Interpretation:
blocked spike can mean:
real attack
crawler / scanner
false positive after rule change
broken client behavior
next check:
WAF sampled requests
WAF logs
ALB access logs by source IP / path / user-agent9. EventBridge For Alarm State#
Route CloudWatch alarm state changes to automation:
{
"source": ["aws.cloudwatch"],
"detail-type": ["CloudWatch Alarm State Change"],
"detail": {
"state": {
"value": ["ALARM"]
},
"alarmName": [{
"prefix": "P"
}]
}
}Create rule:
aws events put-rule \
--name cloudwatch-alarm-to-incident \
--event-pattern file://cloudwatch-alarm-event-pattern.json \
--region ap-east-1Common targets:
target:
SNS topic
Lambda incident router
Step Functions runbook
SSM Automation10. Access Log Investigation#
Access logs are delivered to S3. Use Athena for incident queries.
Typical questions:
what changed:
5xx by target
4xx by path
top source IP
top user-agent
WAF blocked IP and path correlation
latency by target groupExample Athena query after creating the ALB access log table:
SELECT
request_url,
target_status_code,
elb_status_code,
COUNT(*) AS requests
FROM alb_access_logs
WHERE day = '2026/06/03'
AND elb = 'app/prod-public-alb/abc123'
AND (elb_status_code LIKE '5%' OR target_status_code LIKE '5%')
GROUP BY request_url, target_status_code, elb_status_code
ORDER BY requests DESC
LIMIT 50;Top source IP:
SELECT
client_ip,
COUNT(*) AS requests,
SUM(CASE WHEN elb_status_code LIKE '4%' THEN 1 ELSE 0 END) AS elb_4xx,
SUM(CASE WHEN elb_status_code LIKE '5%' THEN 1 ELSE 0 END) AS elb_5xx
FROM alb_access_logs
WHERE day = '2026/06/03'
GROUP BY client_ip
ORDER BY requests DESC
LIMIT 50;11. Dashboard#
Dashboard widgets:
traffic:
RequestCount
ActiveConnectionCount
NewConnectionCount
errors:
HTTPCode_ELB_4XX_Count
HTTPCode_ELB_5XX_Count
HTTPCode_Target_4XX_Count
HTTPCode_Target_5XX_Count
target 5xx rate metric math
latency:
TargetResponseTime p50 / p95 / p99
health:
HealthyHostCount
UnHealthyHostCount
TargetConnectionErrorCount
security:
WAF AllowedRequests
WAF BlockedRequests
WAF CountedRequests12. Incident Checklist#
no healthy target:
describe-target-health
check ECS/EC2 deployment
check app readiness endpoint
check target SG allows ALB SG
check container/instance logs
target 5xx:
compare deployment time
inspect app logs by trace/request id
check dependency errors
check target CPU/memory/database
ALB 5xx:
check target connection errors
check healthy host count
check TLS/protocol mismatch
inspect ALB access logs
latency:
check target p95/p99
check app dependency latency
check target saturation
check database and downstream services
WAF blocked spike:
inspect WAF logs and sampled requests
compare recent WAF rule change
check source IP / country / path
decide block, allowlist, or rule tuning