AWS ECS And Node.js Monitoring


https://docs.aws.amazon.com/AmazonECS/latest/developerguide/available-metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity Alert Meaning Why Monitor Definition Duration
P0 Service unavailable ECS 任务数不足且黑盒健康检查失败 同时证明调度层和用户视角都异常,应立即处理 running_task_count < desired_task_count and blackbox /health failed 3m
P1 Running task too low 实际运行任务数低于期望任务数 发现部署失败、任务崩溃、容量不足或镜像启动失败 running_task_count < desired_task_count 5m
P1 HTTP 5xx high 应用 HTTP 5xx 比例过高 直接反映用户请求失败和发布/依赖问题 5xx_rate >= 5% 5m
P1 P95 latency high 请求尾延迟升高 发现用户体验下降、依赖慢或服务饱和 p95 >= 1s or p95 > 2 * avg_over_time(p95[1h]) 10m
P2 CPU high ECS 服务 CPU 使用率持续高 提前发现容量不足和自动扩缩容风险 CPUUtilization >= 80% 15m
P2 Memory high ECS 服务内存使用率持续高 防止 OOM kill、重启循环和性能抖动 MemoryUtilization >= 85% 15m
P2 Event loop lag high Node.js event loop 阻塞时间高 发现同步阻塞、CPU 密集代码或 GC 压力 nodejs_eventloop_lag_seconds_p95 >= 0.2s 10m
P2 Heap pressure Node.js heap 使用率接近上限 提前发现内存泄漏和 GC 频繁导致的延迟 heap_used / heap_total >= 85% 10m

明显升高 的统一定义:

current_5m > absolute_threshold
and
current_5m > 2 * avg_over_time(metric[1h])
for 10m

2. CloudWatch Metrics#

Metric Meaning Why Monitor Namespace Statistic Period
CPUUtilization ECS service/task 的 CPU 使用率 判断计算资源是否饱和和扩容是否需要触发 AWS/ECS Average 60s
MemoryUtilization ECS service/task 的内存使用率 判断 OOM、内存泄漏和任务重启风险 AWS/ECS Average 60s
RunningTaskCount 当前处于运行状态的 task 数 判断服务实际容量是否达到期望 ECS/ContainerInsights or AWS/ECS if enabled Average 60s
DesiredTaskCount ECS service 期望运行的 task 数 与 RunningTaskCount 对比发现调度或启动失败 ECS/ContainerInsights or AWS/ECS if enabled Average 60s

查询 running task 是否低于 desired task:

[
  {
    "Id": "running",
    "MetricStat": {
      "Metric": {
        "Namespace": "ECS/ContainerInsights",
        "MetricName": "RunningTaskCount",
        "Dimensions": [
          { "Name": "ClusterName", "Value": "prod-cluster" },
          { "Name": "ServiceName", "Value": "api" }
        ]
      },
      "Period": 60,
      "Stat": "Average"
    },
    "ReturnData": false
  },
  {
    "Id": "desired",
    "MetricStat": {
      "Metric": {
        "Namespace": "ECS/ContainerInsights",
        "MetricName": "DesiredTaskCount",
        "Dimensions": [
          { "Name": "ClusterName", "Value": "prod-cluster" },
          { "Name": "ServiceName", "Value": "api" }
        ]
      },
      "Period": 60,
      "Stat": "Average"
    },
    "ReturnData": false
  },
  {
    "Id": "task_gap",
    "Expression": "desired-running",
    "Label": "desired minus running",
    "ReturnData": true
  }
]
aws cloudwatch get-metric-data \
  --start-time 2026-06-02T00:00:00Z \
  --end-time 2026-06-02T01:00:00Z \
  --metric-data-queries file://ecs-running-vs-desired.json

CloudWatch Alarm 条件:

metric math id: task_gap
comparison: GreaterThanThreshold
threshold: 0
evaluation_periods: 5
datapoints_to_alarm: 5
period: 60
severity: P1

3. PromQL#

YACE metric 名称先用下面的查询确认:

# 用正则先找出当前 YACE 暴露出来的 ECS metric 名称。
# 这里不是告警表达式,只是落地前确认 metric name 和 label name。
{__name__=~"aws_ecs_.*(running|desired|cpu|memory).*"}

具体 PromQL:

# running task 小于 desired task,表示 ECS 想要的任务数没有全部跑起来。
# 这个一般是 deployment 卡住、镜像拉取失败、健康检查失败、容量不足。
aws_ecs_running_task_count_average{cluster_name="prod-cluster", service_name="api"}
  < aws_ecs_desired_task_count_average{cluster_name="prod-cluster", service_name="api"}
# ECS service 平均 CPU 使用率超过 80%。
# 适合做 P2 容量风险,不建议单独作为 P1 可用性告警。
aws_ecs_cpu_utilization_average{cluster_name="prod-cluster", service_name="api"} >= 80
# ECS service 平均内存使用率超过 85%。
# Fargate / container 内存打满通常会导致 OOM kill,所以阈值比 CPU 更敏感。
aws_ecs_memory_utilization_average{cluster_name="prod-cluster", service_name="api"} >= 85

Node.js 应用指标:

# 计算 5 分钟内 HTTP 5xx 错误率:
#   分子 = 5xx 请求速率
#   分母 = 全部请求速率
#   乘以 100 = 百分比
# 结果 >= 5 表示 5xx 错误率达到 5%。
100 *
sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
>= 5
# 计算 HTTP 请求耗时的 p95:
#   http_request_duration_seconds_bucket 是 Prometheus histogram bucket
#   rate(...[5m]) 取最近 5 分钟每个 bucket 的增长速率
#   sum by (le) 保留 bucket 边界 le,再把不同 instance 的 bucket 合并
#   histogram_quantile(0.95, ...) 根据 bucket 估算 p95
# 结果单位是 seconds,>= 1 表示 p95 请求耗时超过 1 秒。
histogram_quantile(
  0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="api"}[5m]))
) >= 1
# 计算 Node.js event loop lag 的 p95:
#   event loop lag 高说明 JS 主线程被 CPU、同步代码或 GC 阻塞
#   结果单位是 seconds,>= 0.2 表示 p95 lag 超过 200ms。
histogram_quantile(
  0.95,
  sum by (le) (rate(nodejs_eventloop_lag_seconds_bucket{service="api"}[5m]))
) >= 0.2
# 计算 V8 heap 使用率:
#   used bytes / total bytes = heap 使用比例
#   >= 0.85 表示 heap 已经用了 85%,需要排查内存泄漏或调大内存。
nodejs_heap_size_used_bytes{service="api"}
/
nodejs_heap_size_total_bytes{service="api"}
>= 0.85

4. vmalert Rules#

groups:
  - name: ecs-nodejs.rules
    rules:
      - alert: ECSServiceRunningTaskTooLow
        # desired > running 时触发,表示 ECS service 没有达到期望任务数。
        expr: aws_ecs_running_task_count_average < aws_ecs_desired_task_count_average
        for: 5m
        labels:
          severity: P1
          component: ecs
        annotations:
          summary: "ECS running task count is below desired count"

      - alert: NodeJSHigh5xxRate
        # 5xx rate = 5xx 请求速率 / 总请求速率 * 100。
        # by (service) 是为了每个服务单独计算,不把多个服务混在一起。
        expr: |
          100 * sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service) >= 5
        for: 5m
        labels:
          severity: P1
          component: app
        annotations:
          summary: "Node.js HTTP 5xx rate is >= 5%"

      - alert: NodeJSHighP95Latency
        # 从 histogram bucket 计算每个 service 的 p95 延迟,单位 seconds。
        expr: |
          histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) >= 1
        for: 10m
        labels:
          severity: P1
          component: app
        annotations:
          summary: "Node.js HTTP p95 latency is >= 1s"