Links#
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/available-metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/1. Alert Standard#
| Severity | Alert | Meaning | Why Monitor | Definition | Duration |
|---|---|---|---|---|---|
| P0 | Service unavailable | ECS 任务数不足且黑盒健康检查失败 | 同时证明调度层和用户视角都异常,应立即处理 | running_task_count < desired_task_count and blackbox /health failed |
3m |
| P1 | Running task too low | 实际运行任务数低于期望任务数 | 发现部署失败、任务崩溃、容量不足或镜像启动失败 | running_task_count < desired_task_count |
5m |
| P1 | HTTP 5xx high | 应用 HTTP 5xx 比例过高 | 直接反映用户请求失败和发布/依赖问题 | 5xx_rate >= 5% |
5m |
| P1 | P95 latency high | 请求尾延迟升高 | 发现用户体验下降、依赖慢或服务饱和 | p95 >= 1s or p95 > 2 * avg_over_time(p95[1h]) |
10m |
| P2 | CPU high | ECS 服务 CPU 使用率持续高 | 提前发现容量不足和自动扩缩容风险 | CPUUtilization >= 80% |
15m |
| P2 | Memory high | ECS 服务内存使用率持续高 | 防止 OOM kill、重启循环和性能抖动 | MemoryUtilization >= 85% |
15m |
| P2 | Event loop lag high | Node.js event loop 阻塞时间高 | 发现同步阻塞、CPU 密集代码或 GC 压力 | nodejs_eventloop_lag_seconds_p95 >= 0.2s |
10m |
| P2 | Heap pressure | Node.js heap 使用率接近上限 | 提前发现内存泄漏和 GC 频繁导致的延迟 | heap_used / heap_total >= 85% |
10m |
明显升高 的统一定义:
current_5m > absolute_threshold
and
current_5m > 2 * avg_over_time(metric[1h])
for 10m2. CloudWatch Metrics#
| Metric | Meaning | Why Monitor | Namespace | Statistic | Period |
|---|---|---|---|---|---|
| CPUUtilization | ECS service/task 的 CPU 使用率 | 判断计算资源是否饱和和扩容是否需要触发 | AWS/ECS | Average | 60s |
| MemoryUtilization | ECS service/task 的内存使用率 | 判断 OOM、内存泄漏和任务重启风险 | AWS/ECS | Average | 60s |
| RunningTaskCount | 当前处于运行状态的 task 数 | 判断服务实际容量是否达到期望 | ECS/ContainerInsights or AWS/ECS if enabled | Average | 60s |
| DesiredTaskCount | ECS service 期望运行的 task 数 | 与 RunningTaskCount 对比发现调度或启动失败 | ECS/ContainerInsights or AWS/ECS if enabled | Average | 60s |
查询 running task 是否低于 desired task:
[
{
"Id": "running",
"MetricStat": {
"Metric": {
"Namespace": "ECS/ContainerInsights",
"MetricName": "RunningTaskCount",
"Dimensions": [
{ "Name": "ClusterName", "Value": "prod-cluster" },
{ "Name": "ServiceName", "Value": "api" }
]
},
"Period": 60,
"Stat": "Average"
},
"ReturnData": false
},
{
"Id": "desired",
"MetricStat": {
"Metric": {
"Namespace": "ECS/ContainerInsights",
"MetricName": "DesiredTaskCount",
"Dimensions": [
{ "Name": "ClusterName", "Value": "prod-cluster" },
{ "Name": "ServiceName", "Value": "api" }
]
},
"Period": 60,
"Stat": "Average"
},
"ReturnData": false
},
{
"Id": "task_gap",
"Expression": "desired-running",
"Label": "desired minus running",
"ReturnData": true
}
]aws cloudwatch get-metric-data \
--start-time 2026-06-02T00:00:00Z \
--end-time 2026-06-02T01:00:00Z \
--metric-data-queries file://ecs-running-vs-desired.jsonCloudWatch Alarm 条件:
metric math id: task_gap
comparison: GreaterThanThreshold
threshold: 0
evaluation_periods: 5
datapoints_to_alarm: 5
period: 60
severity: P13. PromQL#
YACE metric 名称先用下面的查询确认:
# 用正则先找出当前 YACE 暴露出来的 ECS metric 名称。
# 这里不是告警表达式,只是落地前确认 metric name 和 label name。
{__name__=~"aws_ecs_.*(running|desired|cpu|memory).*"}具体 PromQL:
# running task 小于 desired task,表示 ECS 想要的任务数没有全部跑起来。
# 这个一般是 deployment 卡住、镜像拉取失败、健康检查失败、容量不足。
aws_ecs_running_task_count_average{cluster_name="prod-cluster", service_name="api"}
< aws_ecs_desired_task_count_average{cluster_name="prod-cluster", service_name="api"}# ECS service 平均 CPU 使用率超过 80%。
# 适合做 P2 容量风险,不建议单独作为 P1 可用性告警。
aws_ecs_cpu_utilization_average{cluster_name="prod-cluster", service_name="api"} >= 80# ECS service 平均内存使用率超过 85%。
# Fargate / container 内存打满通常会导致 OOM kill,所以阈值比 CPU 更敏感。
aws_ecs_memory_utilization_average{cluster_name="prod-cluster", service_name="api"} >= 85Node.js 应用指标:
# 计算 5 分钟内 HTTP 5xx 错误率:
# 分子 = 5xx 请求速率
# 分母 = 全部请求速率
# 乘以 100 = 百分比
# 结果 >= 5 表示 5xx 错误率达到 5%。
100 *
sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
>= 5# 计算 HTTP 请求耗时的 p95:
# http_request_duration_seconds_bucket 是 Prometheus histogram bucket
# rate(...[5m]) 取最近 5 分钟每个 bucket 的增长速率
# sum by (le) 保留 bucket 边界 le,再把不同 instance 的 bucket 合并
# histogram_quantile(0.95, ...) 根据 bucket 估算 p95
# 结果单位是 seconds,>= 1 表示 p95 请求耗时超过 1 秒。
histogram_quantile(
0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{service="api"}[5m]))
) >= 1# 计算 Node.js event loop lag 的 p95:
# event loop lag 高说明 JS 主线程被 CPU、同步代码或 GC 阻塞
# 结果单位是 seconds,>= 0.2 表示 p95 lag 超过 200ms。
histogram_quantile(
0.95,
sum by (le) (rate(nodejs_eventloop_lag_seconds_bucket{service="api"}[5m]))
) >= 0.2# 计算 V8 heap 使用率:
# used bytes / total bytes = heap 使用比例
# >= 0.85 表示 heap 已经用了 85%,需要排查内存泄漏或调大内存。
nodejs_heap_size_used_bytes{service="api"}
/
nodejs_heap_size_total_bytes{service="api"}
>= 0.854. vmalert Rules#
groups:
- name: ecs-nodejs.rules
rules:
- alert: ECSServiceRunningTaskTooLow
# desired > running 时触发,表示 ECS service 没有达到期望任务数。
expr: aws_ecs_running_task_count_average < aws_ecs_desired_task_count_average
for: 5m
labels:
severity: P1
component: ecs
annotations:
summary: "ECS running task count is below desired count"
- alert: NodeJSHigh5xxRate
# 5xx rate = 5xx 请求速率 / 总请求速率 * 100。
# by (service) 是为了每个服务单独计算,不把多个服务混在一起。
expr: |
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service) >= 5
for: 5m
labels:
severity: P1
component: app
annotations:
summary: "Node.js HTTP 5xx rate is >= 5%"
- alert: NodeJSHighP95Latency
# 从 histogram bucket 计算每个 service 的 p95 延迟,单位 seconds。
expr: |
histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) >= 1
for: 10m
labels:
severity: P1
component: app
annotations:
summary: "Node.js HTTP p95 latency is >= 1s"