Links#
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create-alarm-on-metric-math-expression.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarm-evaluation
https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-metric-alarm.html
https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html
https://docs.aws.amazon.com/lambda/latest/dg/with-sns.html1. Important Points#
CloudWatch Alarm 是 AWS 原生告警入口,适合 AWS managed service、CloudWatch metric math、自定义指标和基础 fallback 告警。
CloudWatch Alarm 用来做:
AWS service metric alert
metric math alert
anomaly detection alert
composite alarm
SNS / Lambda / incident tool integration
CloudWatch Alarm 不适合:
high-cardinality label alerting
complex PromQL-style aggregation
large dynamic target discovery
long historical trend query推荐架构:
CloudWatch Alarm
-> SNS Topic by env/severity
-> Lambda notification router
-> Slack / Teams / PagerDuty / Opsgenie / webhook2. Naming Standard#
Recommended format:
<severity>-<env>-<service>-<resource_type>-<resource>-<signal>Examples:
P0-prod-alb-targetgroup-api-HealthyHostCount
P1-prod-alb-targetgroup-api-Target5xxRate
P1-prod-ecs-service-api-TaskBelowDesired
P2-prod-ecs-service-api-CPUUtilizationHigh
P1-prod-aurora-cluster-main-CPUUtilization
P2-uat-sqs-queue-worker-OldestMessageAgeCloudWatch tags:
env=prod
service=api
team=platform
severity=P1
resource_type=targetgroup
runbook_url=https://docs.example.com/runbooks/alb-target-5xx
dashboard_url=https://console.aws.amazon.com/cloudwatch/...3. SNS Topics#
Create topics by environment and severity class:
aws sns create-topic \
--name prod-critical-alerts \
--region ap-east-1
aws sns create-topic \
--name prod-warning-alerts \
--region ap-east-1
aws sns create-topic \
--name uat-warning-alerts \
--region ap-east-1Subscribe Lambda notification router:
aws sns subscribe \
--topic-arn arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--protocol lambda \
--notification-endpoint arn:aws:lambda:ap-east-1:111122223333:function:cloudwatch-alarm-router \
--region ap-east-1Allow SNS to invoke Lambda:
aws lambda add-permission \
--function-name cloudwatch-alarm-router \
--statement-id sns-prod-critical-alerts \
--action lambda:InvokeFunction \
--principal sns.amazonaws.com \
--source-arn arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--region ap-east-14. Lambda Router Payload#
CloudWatch alarm SNS message should be normalized before sending to final channel.
aws_alarm_name: alarm name
aws_account_id: account id
aws_region: region
aws_namespace: AWS/ApplicationELB
aws_resource: resource id / arn / dimension value
env: prod / uat / dev
service: api / worker / payment
resource_type: targetgroup / ecs-service / queue / cluster
severity: P0 / P1 / P2 / P3
source: cloudwatch
alarm_state: ALARM / OK / INSUFFICIENT_DATA
previous_state: OK / ALARM / INSUFFICIENT_DATA
reason: CloudWatch alarm reason
alarm_time: state change time
dashboard_url: dashboard link
runbook_url: runbook linkMinimal Lambda parser shape:
export const handler = async (event) => {
for (const record of event.Records) {
const message = JSON.parse(record.Sns.Message);
const alarmName = message.AlarmName;
const state = message.NewStateValue;
const previousState = message.OldStateValue;
const normalized = {
source: "cloudwatch",
aws_alarm_name: alarmName,
aws_account_id: message.AWSAccountId,
aws_region: message.Region,
alarm_state: state,
previous_state: previousState,
reason: message.NewStateReason,
alarm_time: message.StateChangeTime,
trigger: message.Trigger
};
console.log(JSON.stringify(normalized));
}
};5. Basic Alarm#
Example: ALB target group has no healthy target.
aws cloudwatch put-metric-alarm \
--alarm-name "P0-prod-alb-targetgroup-api-HealthyHostCount" \
--alarm-description "No healthy target in prod api target group" \
--namespace AWS/ApplicationELB \
--metric-name HealthyHostCount \
--dimensions \
Name=TargetGroup,Value=targetgroup/prod-api/def456 \
Name=LoadBalancer,Value=app/prod-alb/abc123 \
--statistic Minimum \
--period 60 \
--evaluation-periods 2 \
--datapoints-to-alarm 2 \
--threshold 1 \
--comparison-operator LessThanThreshold \
--treat-missing-data breaching \
--alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--ok-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--tags \
Key=env,Value=prod \
Key=service,Value=api \
Key=team,Value=platform \
Key=severity,Value=P0 \
Key=resource_type,Value=targetgroup \
--region ap-east-1Verify:
aws cloudwatch describe-alarms \
--alarm-names "P0-prod-alb-targetgroup-api-HealthyHostCount" \
--region ap-east-16. Metric Math Alarm#
Example: ALB target 5xx rate.
aws cloudwatch put-metric-alarm \
--alarm-name "P1-prod-alb-targetgroup-api-Target5xxRate" \
--alarm-description "Target 5xx rate >= 5% for 5 minutes" \
--evaluation-periods 1 \
--datapoints-to-alarm 1 \
--threshold 5 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--metrics '[
{
"Id": "req",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "RequestCount",
"Dimensions": [
{"Name": "LoadBalancer", "Value": "app/prod-alb/abc123"},
{"Name": "TargetGroup", "Value": "targetgroup/prod-api/def456"}
]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "e5xx",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "HTTPCode_Target_5XX_Count",
"Dimensions": [
{"Name": "LoadBalancer", "Value": "app/prod-alb/abc123"},
{"Name": "TargetGroup", "Value": "targetgroup/prod-api/def456"}
]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "target_5xx_rate",
"Expression": "IF(req>=100,100*e5xx/req,0)",
"Label": "Target 5xx percent",
"ReturnData": true
}
]' \
--alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--ok-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--region ap-east-17. Missing Data#
CloudWatch missing data behavior must be explicit.
| Use Case | treat-missing-data |
|---|---|
| health count / exporter alive | breaching |
| request error rate | notBreaching |
| sparse event metric | notBreaching |
| heartbeat metric | breaching |
example:
HealthyHostCount missing can indicate metric/resource issue
Target5xxRate missing during no traffic should not page8. Composite Alarm#
Use composite alarms to reduce noise when a high-level symptom already covers lower-level symptoms.
example:
ServiceUnavailable = ALBNoHealthyTarget OR ECSRunningTaskBelowDesired
notify P0 only from composite alarm
keep component alarms for diagnosisComposite alarm example:
aws cloudwatch put-composite-alarm \
--alarm-name "P0-prod-api-service-ServiceUnavailable" \
--alarm-rule 'ALARM("P0-prod-alb-targetgroup-api-HealthyHostCount") OR ALARM("P1-prod-ecs-service-api-TaskBelowDesired")' \
--alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
--region ap-east-19. Production Checklist#
naming:
severity first
env/service/resource_type/resource/signal included
actions:
alarm-actions configured
ok-actions configured for P0/P1
insufficient-data-actions configured where useful
metadata:
tags include severity/env/service/team
runbook and dashboard available
routing:
prod critical topic separate from warning topic
Lambda router normalizes payload
final notification includes account and region
quality:
missing-data behavior reviewed
metric math handles low traffic
alarms tested with set-alarm-state in non-prod