CloudWatch Alarms


https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create-alarm-on-metric-math-expression.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarm-evaluation
https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-metric-alarm.html
https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html
https://docs.aws.amazon.com/lambda/latest/dg/with-sns.html

1. Important Points#

CloudWatch Alarm 是 AWS 原生告警入口,适合 AWS managed service、CloudWatch metric math、自定义指标和基础 fallback 告警。

CloudWatch Alarm 用来做:
    AWS service metric alert
    metric math alert
    anomaly detection alert
    composite alarm
    SNS / Lambda / incident tool integration

CloudWatch Alarm 不适合:
    high-cardinality label alerting
    complex PromQL-style aggregation
    large dynamic target discovery
    long historical trend query

推荐架构:

CloudWatch Alarm
  -> SNS Topic by env/severity
  -> Lambda notification router
  -> Slack / Teams / PagerDuty / Opsgenie / webhook

2. Naming Standard#

Recommended format:

<severity>-<env>-<service>-<resource_type>-<resource>-<signal>

Examples:

P0-prod-alb-targetgroup-api-HealthyHostCount
P1-prod-alb-targetgroup-api-Target5xxRate
P1-prod-ecs-service-api-TaskBelowDesired
P2-prod-ecs-service-api-CPUUtilizationHigh
P1-prod-aurora-cluster-main-CPUUtilization
P2-uat-sqs-queue-worker-OldestMessageAge

CloudWatch tags:

env=prod
service=api
team=platform
severity=P1
resource_type=targetgroup
runbook_url=https://docs.example.com/runbooks/alb-target-5xx
dashboard_url=https://console.aws.amazon.com/cloudwatch/...

3. SNS Topics#

Create topics by environment and severity class:

aws sns create-topic \
  --name prod-critical-alerts \
  --region ap-east-1

aws sns create-topic \
  --name prod-warning-alerts \
  --region ap-east-1

aws sns create-topic \
  --name uat-warning-alerts \
  --region ap-east-1

Subscribe Lambda notification router:

aws sns subscribe \
  --topic-arn arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --protocol lambda \
  --notification-endpoint arn:aws:lambda:ap-east-1:111122223333:function:cloudwatch-alarm-router \
  --region ap-east-1

Allow SNS to invoke Lambda:

aws lambda add-permission \
  --function-name cloudwatch-alarm-router \
  --statement-id sns-prod-critical-alerts \
  --action lambda:InvokeFunction \
  --principal sns.amazonaws.com \
  --source-arn arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --region ap-east-1

4. Lambda Router Payload#

CloudWatch alarm SNS message should be normalized before sending to final channel.

aws_alarm_name: alarm name
aws_account_id: account id
aws_region: region
aws_namespace: AWS/ApplicationELB
aws_resource: resource id / arn / dimension value
env: prod / uat / dev
service: api / worker / payment
resource_type: targetgroup / ecs-service / queue / cluster
severity: P0 / P1 / P2 / P3
source: cloudwatch
alarm_state: ALARM / OK / INSUFFICIENT_DATA
previous_state: OK / ALARM / INSUFFICIENT_DATA
reason: CloudWatch alarm reason
alarm_time: state change time
dashboard_url: dashboard link
runbook_url: runbook link

Minimal Lambda parser shape:

export const handler = async (event) => {
  for (const record of event.Records) {
    const message = JSON.parse(record.Sns.Message);
    const alarmName = message.AlarmName;
    const state = message.NewStateValue;
    const previousState = message.OldStateValue;

    const normalized = {
      source: "cloudwatch",
      aws_alarm_name: alarmName,
      aws_account_id: message.AWSAccountId,
      aws_region: message.Region,
      alarm_state: state,
      previous_state: previousState,
      reason: message.NewStateReason,
      alarm_time: message.StateChangeTime,
      trigger: message.Trigger
    };

    console.log(JSON.stringify(normalized));
  }
};

5. Basic Alarm#

Example: ALB target group has no healthy target.

aws cloudwatch put-metric-alarm \
  --alarm-name "P0-prod-alb-targetgroup-api-HealthyHostCount" \
  --alarm-description "No healthy target in prod api target group" \
  --namespace AWS/ApplicationELB \
  --metric-name HealthyHostCount \
  --dimensions \
      Name=TargetGroup,Value=targetgroup/prod-api/def456 \
      Name=LoadBalancer,Value=app/prod-alb/abc123 \
  --statistic Minimum \
  --period 60 \
  --evaluation-periods 2 \
  --datapoints-to-alarm 2 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --treat-missing-data breaching \
  --alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --ok-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --tags \
      Key=env,Value=prod \
      Key=service,Value=api \
      Key=team,Value=platform \
      Key=severity,Value=P0 \
      Key=resource_type,Value=targetgroup \
  --region ap-east-1

Verify:

aws cloudwatch describe-alarms \
  --alarm-names "P0-prod-alb-targetgroup-api-HealthyHostCount" \
  --region ap-east-1

6. Metric Math Alarm#

Example: ALB target 5xx rate.

aws cloudwatch put-metric-alarm \
  --alarm-name "P1-prod-alb-targetgroup-api-Target5xxRate" \
  --alarm-description "Target 5xx rate >= 5% for 5 minutes" \
  --evaluation-periods 1 \
  --datapoints-to-alarm 1 \
  --threshold 5 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --metrics '[
    {
      "Id": "req",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "RequestCount",
          "Dimensions": [
            {"Name": "LoadBalancer", "Value": "app/prod-alb/abc123"},
            {"Name": "TargetGroup", "Value": "targetgroup/prod-api/def456"}
          ]
        },
        "Period": 300,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "e5xx",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "HTTPCode_Target_5XX_Count",
          "Dimensions": [
            {"Name": "LoadBalancer", "Value": "app/prod-alb/abc123"},
            {"Name": "TargetGroup", "Value": "targetgroup/prod-api/def456"}
          ]
        },
        "Period": 300,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "target_5xx_rate",
      "Expression": "IF(req>=100,100*e5xx/req,0)",
      "Label": "Target 5xx percent",
      "ReturnData": true
    }
  ]' \
  --alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --ok-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --region ap-east-1

7. Missing Data#

CloudWatch missing data behavior must be explicit.

Use Case treat-missing-data
health count / exporter alive breaching
request error rate notBreaching
sparse event metric notBreaching
heartbeat metric breaching
example:
    HealthyHostCount missing can indicate metric/resource issue
    Target5xxRate missing during no traffic should not page

8. Composite Alarm#

Use composite alarms to reduce noise when a high-level symptom already covers lower-level symptoms.

example:
    ServiceUnavailable = ALBNoHealthyTarget OR ECSRunningTaskBelowDesired
    notify P0 only from composite alarm
    keep component alarms for diagnosis

Composite alarm example:

aws cloudwatch put-composite-alarm \
  --alarm-name "P0-prod-api-service-ServiceUnavailable" \
  --alarm-rule 'ALARM("P0-prod-alb-targetgroup-api-HealthyHostCount") OR ALARM("P1-prod-ecs-service-api-TaskBelowDesired")' \
  --alarm-actions arn:aws:sns:ap-east-1:111122223333:prod-critical-alerts \
  --region ap-east-1

9. Production Checklist#

naming:
    severity first
    env/service/resource_type/resource/signal included

actions:
    alarm-actions configured
    ok-actions configured for P0/P1
    insufficient-data-actions configured where useful

metadata:
    tags include severity/env/service/team
    runbook and dashboard available

routing:
    prod critical topic separate from warning topic
    Lambda router normalizes payload
    final notification includes account and region

quality:
    missing-data behavior reviewed
    metric math handles low traffic
    alarms tested with set-alarm-state in non-prod