Alarm Statistics

Links#

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Statistics-definitions.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Percentiles
https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-metric-alarm.html

1. Important Points#

CloudWatch Alarm 的 statistic 是把一个 period 内的 metric data points 聚合成一个值，然后拿这个值和 threshold 比较。

period 内有多个样本:
    statistic 决定这些样本如何聚合

alarm 评估:
    aggregated value
      compared with threshold
      across evaluation periods

创建普通 metric alarm 时有两个入口:

Option	用途	例子	注意
`--statistic`	基础统计	`Average`, `Sum`, `Maximum`	不能和 `--extended-statistic` 同时使用
`--extended-statistic`	分位数、trimmed mean、winsorized mean 等	`p90`, `tm90`, `TS(:90%)`	不能和 `--statistic` 同时使用

2. Basic Statistics#

这些值使用 --statistic。

Statistic	代表什么	适合监控什么	例子
`SampleCount`	period 内有多少个样本点	请求量、采样数量、是否有足够数据	1 分钟内收到 120 个 latency 样本，`SampleCount = 120`
`Sum`	period 内所有样本值相加	请求数、错误数、字节数、message count 这类 counter/count metric	`HTTPCode_Target_5XX_Count` 用 `Sum` 看 1 分钟内总 5xx 数
`Average`	`Sum / SampleCount`	CPU、memory、connection utilization 这类平滑趋势	`CPUUtilization` 用 `Average` 看 5 分钟平均 CPU
`Minimum`	period 内最小值	健康实例数、可用容量下限	`HealthyHostCount` 用 `Minimum`，只要曾经低于阈值就要关注
`Maximum`	period 内最大值	spike、queue age、磁盘水位、连接数峰值	`ApproximateAgeOfOldestMessage` 用 `Maximum` 看最老消息是否冲高

简单理解:

samples in one period:
    10, 20, 30, 40

SampleCount = 4
Sum         = 100
Average     = 25
Minimum     = 10
Maximum     = 40

3. Extended Statistics#

这些值使用 --extended-statistic。

Extended statistic	代表什么	适合监控什么	例子
`pNN`	percentile，`p90` 表示约 90% 样本低于该值	latency、duration、CPU 分布、尾部体验	`p90`, `p95`, `p99`, `p99.9`
`tmNN`	trimmed mean，去掉高于 NN percentile 的样本后再算平均值	去掉极端高 outlier 后看典型延迟	`tm90` 等价于只用最低 90% 的样本算平均
`TM(:NN%)`	trimmed mean range，保留低端到 NN percentile 的样本	和 `tmNN` 类似，但写法更明确	`TM(:90%)`
`TM(N%:M%)`	trimmed mean range，保留 N 到 M percentile 的样本	同时去掉极低值和极高值	`TM(10%:90%)`
`TM(low:high)`	trimmed mean fixed range，保留固定数值范围内的样本	只分析业务认可范围内的值	`TM(0.05:2)` 表示只用大于 0.05 秒且小于等于 2 秒的样本
`IQM`	interquartile mean，中间 50% 样本的 trimmed mean	非常抗 outlier 的典型体验	等价于 `TM(25%:75%)`
`wmNN`	winsorized mean，把高于 NN percentile 的值压到边界值后算平均	保留 outlier 影响，但不让极端值完全支配平均	`wm90`
`WM(N%:M%)`	winsorized mean range，把范围外的值压到边界值后算平均	比 TM 更保守的 outlier 处理	`WM(10%:90%)`
`PR(:high)`	percentile rank，小于等于 high 的样本占比	SLA/SLO 风格: 多少请求低于某个延迟	`PR(:0.3)` 表示多少比例请求 `<= 0.3s`
`PR(low:high)`	percentile rank，落在固定范围内的样本占比	手工做 histogram bucket	`PR(0.1:0.5)`
`tcNN`	trimmed count，trimmed range 内参与计算的样本数量	配合 `tmNN` 看样本量是否足够	`tc90`
`TC(:NN%)`	trimmed count range	看某个 percentile 范围内有多少样本	`TC(:90%)`
`tsNN`	trimmed sum，trimmed range 内样本值求和	需要去掉 outlier 后看总量	`ts90`
`TS(:NN%)`	trimmed sum range	和 `TC`/`TM` 一起分析 trimmed 数据	`TS(:90%)`

边界规则:

p95.5:
    percentile 可以带小数

tm99:
    小写 shortcut，只能用于一个百分位数
    等价于 TM(:99%)

TM(10%:90%):
    大写 range syntax，按 percentile 边界过滤

TM(0.05:2):
    大写 range syntax，按固定 metric value 边界过滤

fixed value range:
    applies to TM / TC / TS / WM / PR
    lower bound 是 exclusive
    upper bound 是 inclusive

4. How To Choose#

场景	推荐 statistic	理由
CPU / memory / utilization	`Average`	资源利用率通常看持续平均压力
RequestCount / 4xx / 5xx / processed messages	`Sum`	这类 metric 的 period 总数才有意义
latency 用户体验	`p90`, `p95`, `p99`	average 会隐藏慢请求，maximum 又容易被单个异常值放大
latency 需要抗 outlier	`tm90`, `tm99`, `IQM`	比 average 更能代表典型请求，比 maximum 更稳定
SLA: 95% 请求必须低于 300ms	`PR(:0.3)`	直接得到满足目标的样本比例
queue age / oldest message age	`Maximum`	最老消息短时间冲高也可能代表消费积压
healthy host / available capacity	`Minimum`	下限跌破阈值比平均值更重要
配合 trimmed mean 检查样本量	`tc90` / `TC(...)`	trimmed mean 样本太少时告警可信度会下降

5. CLI Examples#

Average CPU#

# 看 EC2 在 5 分钟窗口内的平均 CPU 是否持续超过 80%
aws cloudwatch put-metric-alarm \
  --alarm-name P2-prod-ec2-api-CPUUtilizationHigh \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistic Average \
  --period 300 \
  --evaluation-periods 3 \
  --datapoints-to-alarm 2 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data missing \
  --region ap-east-1

Sum ALB 5xx#

# 看 ALB target group 在 1 分钟内总共出现多少个 target 5xx
aws cloudwatch put-metric-alarm \
  --alarm-name P1-prod-alb-api-Target5xxHigh \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_Target_5XX_Count \
  --dimensions Name=LoadBalancer,Value=app/prod-api/0123456789abcdef Name=TargetGroup,Value=targetgroup/prod-api/0123456789abcdef \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 20 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --region ap-east-1

Minimum Healthy Host Count#

# 看 ALB target group 在每个 period 内是否曾经低于 2 个 healthy target
aws cloudwatch put-metric-alarm \
  --alarm-name P0-prod-alb-api-HealthyHostCountLow \
  --namespace AWS/ApplicationELB \
  --metric-name HealthyHostCount \
  --dimensions Name=LoadBalancer,Value=app/prod-api/0123456789abcdef Name=TargetGroup,Value=targetgroup/prod-api/0123456789abcdef \
  --statistic Minimum \
  --period 60 \
  --evaluation-periods 3 \
  --datapoints-to-alarm 2 \
  --threshold 2 \
  --comparison-operator LessThanThreshold \
  --treat-missing-data breaching \
  --region ap-east-1

p90 Latency#

# 看 90% 请求的目标响应时间是否持续大于 1 秒
aws cloudwatch put-metric-alarm \
  --alarm-name P1-prod-alb-api-TargetResponseTimeP90High \
  --namespace AWS/ApplicationELB \
  --metric-name TargetResponseTime \
  --dimensions Name=LoadBalancer,Value=app/prod-api/0123456789abcdef Name=TargetGroup,Value=targetgroup/prod-api/0123456789abcdef \
  --extended-statistic p90 \
  --period 60 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 1 \
  --comparison-operator GreaterThanThreshold \
  --evaluate-low-sample-count-percentile ignore \
  --treat-missing-data notBreaching \
  --region ap-east-1

tm90 Latency#

# 去掉最高 10% 慢请求后，看典型请求的平均响应时间是否仍然超过 600ms
aws cloudwatch put-metric-alarm \
  --alarm-name P2-prod-alb-api-TargetResponseTimeTM90High \
  --namespace AWS/ApplicationELB \
  --metric-name TargetResponseTime \
  --dimensions Name=LoadBalancer,Value=app/prod-api/0123456789abcdef Name=TargetGroup,Value=targetgroup/prod-api/0123456789abcdef \
  --extended-statistic tm90 \
  --period 60 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 0.6 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --region ap-east-1

PR For SLO#

# 看 300ms 以内的请求比例是否低于 95%
aws cloudwatch put-metric-alarm \
  --alarm-name P1-prod-alb-api-TargetResponseTimePR300msLow \
  --namespace AWS/ApplicationELB \
  --metric-name TargetResponseTime \
  --dimensions Name=LoadBalancer,Value=app/prod-api/0123456789abcdef Name=TargetGroup,Value=targetgroup/prod-api/0123456789abcdef \
  --extended-statistic 'PR(:0.3)' \
  --period 60 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 95 \
  --comparison-operator LessThanThreshold \
  --treat-missing-data notBreaching \
  --region ap-east-1

TC And TS#

# 看去掉最高 10% outlier 后，参与 tm90 计算的样本数是否太少
aws cloudwatch put-metric-alarm \
  --alarm-name P3-prod-alb-api-TargetResponseTimeTC90Low \
  --namespace AWS/ApplicationELB \
  --metric-name TargetResponseTime \
  --dimensions Name=LoadBalancer,Value=app/prod-api/0123456789abcdef Name=TargetGroup,Value=targetgroup/prod-api/0123456789abcdef \
  --extended-statistic tc90 \
  --period 60 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 50 \
  --comparison-operator LessThanThreshold \
  --treat-missing-data notBreaching \
  --region ap-east-1

# 看去掉最高 10% outlier 后，trimmed latency sum 是否异常升高
aws cloudwatch put-metric-alarm \
  --alarm-name P3-prod-alb-api-TargetResponseTimeTS90High \
  --namespace AWS/ApplicationELB \
  --metric-name TargetResponseTime \
  --dimensions Name=LoadBalancer,Value=app/prod-api/0123456789abcdef Name=TargetGroup,Value=targetgroup/prod-api/0123456789abcdef \
  --extended-statistic ts90 \
  --period 60 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --region ap-east-1

6. Production Notes#

percentile / trimmed statistics:
    require raw data points or compatible statistic sets
    do not work when metric values contain negative numbers

low traffic services:
    use --evaluate-low-sample-count-percentile ignore for percentile alarms
    otherwise p99 / p90 can flap when sample count is too low

latency alarms:
    use p90/p95/p99 for user-facing tail latency
    use tm90/tm99 when you want stable latency without extreme outliers
    keep tc90/TC(...) on dashboard when sample size matters

counter alarms:
    use Sum, not Average

gauge alarms:
    use Average for sustained pressure
    use Maximum or Minimum when a short spike/drop matters

metric math alarms:
    do not use top-level --statistic / --extended-statistic
    put Stat inside the --metrics MetricStat object