AWS ECS And Node.js Monitoring

Links#

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/available-metrics.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity	Alert	Meaning	Why Monitor	Definition	Duration
P0	Service unavailable	ECS 任务数不足且黑盒健康检查失败	同时证明调度层和用户视角都异常，应立即处理	`running_task_count < desired_task_count` and blackbox `/health` failed	3m
P1	Running task too low	实际运行任务数低于期望任务数	发现部署失败、任务崩溃、容量不足或镜像启动失败	`running_task_count < desired_task_count`	5m
P1	HTTP 5xx high	应用 HTTP 5xx 比例过高	直接反映用户请求失败和发布/依赖问题	`5xx_rate >= 5%`	5m
P1	P95 latency high	请求尾延迟升高	发现用户体验下降、依赖慢或服务饱和	`p95 >= 1s` or `p95 > 2 * avg_over_time(p95[1h])`	10m
P2	CPU high	ECS 服务 CPU 使用率持续高	提前发现容量不足和自动扩缩容风险	`CPUUtilization >= 80%`	15m
P2	Memory high	ECS 服务内存使用率持续高	防止 OOM kill、重启循环和性能抖动	`MemoryUtilization >= 85%`	15m
P2	Event loop lag high	Node.js event loop 阻塞时间高	发现同步阻塞、CPU 密集代码或 GC 压力	`nodejs_eventloop_lag_seconds_p95 >= 0.2s`	10m
P2	Heap pressure	Node.js heap 使用率接近上限	提前发现内存泄漏和 GC 频繁导致的延迟	`heap_used / heap_total >= 85%`	10m

明显升高 的统一定义：

current_5m > absolute_threshold
and
current_5m > 2 * avg_over_time(metric[1h])
for 10m

2. CloudWatch Metrics#

Metric	Meaning	Why Monitor	Namespace	Statistic	Period
CPUUtilization	ECS service/task 的 CPU 使用率	判断计算资源是否饱和和扩容是否需要触发	AWS/ECS	Average	60s
MemoryUtilization	ECS service/task 的内存使用率	判断 OOM、内存泄漏和任务重启风险	AWS/ECS	Average	60s
RunningTaskCount	当前处于运行状态的 task 数	判断服务实际容量是否达到期望	ECS/ContainerInsights or AWS/ECS if enabled	Average	60s
DesiredTaskCount	ECS service 期望运行的 task 数	与 RunningTaskCount 对比发现调度或启动失败	ECS/ContainerInsights or AWS/ECS if enabled	Average	60s

查询 running task 是否低于 desired task：

[
  {
    "Id": "running",
    "MetricStat": {
      "Metric": {
        "Namespace": "ECS/ContainerInsights",
        "MetricName": "RunningTaskCount",
        "Dimensions": [
          { "Name": "ClusterName", "Value": "prod-cluster" },
          { "Name": "ServiceName", "Value": "api" }
        ]
      },
      "Period": 60,
      "Stat": "Average"
    },
    "ReturnData": false
  },
  {
    "Id": "desired",
    "MetricStat": {
      "Metric": {
        "Namespace": "ECS/ContainerInsights",
        "MetricName": "DesiredTaskCount",
        "Dimensions": [
          { "Name": "ClusterName", "Value": "prod-cluster" },
          { "Name": "ServiceName", "Value": "api" }
        ]
      },
      "Period": 60,
      "Stat": "Average"
    },
    "ReturnData": false
  },
  {
    "Id": "task_gap",
    "Expression": "desired-running",
    "Label": "desired minus running",
    "ReturnData": true
  }
]

aws cloudwatch get-metric-data \
  --start-time 2026-06-02T00:00:00Z \
  --end-time 2026-06-02T01:00:00Z \
  --metric-data-queries file://ecs-running-vs-desired.json

CloudWatch Alarm 条件：

metric math id: task_gap
comparison: GreaterThanThreshold
threshold: 0
evaluation_periods: 5
datapoints_to_alarm: 5
period: 60
severity: P1

3. PromQL#

YACE metric 名称先用下面的查询确认：

# 用正则先找出当前 YACE 暴露出来的 ECS metric 名称。
# 这里不是告警表达式，只是落地前确认 metric name 和 label name。
{__name__=~"aws_ecs_.*(running|desired|cpu|memory).*"}

具体 PromQL：

# running task 小于 desired task，表示 ECS 想要的任务数没有全部跑起来。
# 这个一般是 deployment 卡住、镜像拉取失败、健康检查失败、容量不足。
aws_ecs_running_task_count_average{cluster_name="prod-cluster", service_name="api"}
  < aws_ecs_desired_task_count_average{cluster_name="prod-cluster", service_name="api"}

# ECS service 平均 CPU 使用率超过 80%。
# 适合做 P2 容量风险，不建议单独作为 P1 可用性告警。
aws_ecs_cpu_utilization_average{cluster_name="prod-cluster", service_name="api"} >= 80

# ECS service 平均内存使用率超过 85%。
# Fargate / container 内存打满通常会导致 OOM kill，所以阈值比 CPU 更敏感。
aws_ecs_memory_utilization_average{cluster_name="prod-cluster", service_name="api"} >= 85

Node.js 应用指标：

# 计算 5 分钟内 HTTP 5xx 错误率:
#   分子 = 5xx 请求速率
#   分母 = 全部请求速率
#   乘以 100 = 百分比
# 结果 >= 5 表示 5xx 错误率达到 5%。
100 *
sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
>= 5

# 计算 HTTP 请求耗时的 p95:
#   http_request_duration_seconds_bucket 是 Prometheus histogram bucket
#   rate(...[5m]) 取最近 5 分钟每个 bucket 的增长速率
#   sum by (le) 保留 bucket 边界 le，再把不同 instance 的 bucket 合并
#   histogram_quantile(0.95, ...) 根据 bucket 估算 p95
# 结果单位是 seconds，>= 1 表示 p95 请求耗时超过 1 秒。
histogram_quantile(
  0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="api"}[5m]))
) >= 1

# 计算 Node.js event loop lag 的 p95:
#   event loop lag 高说明 JS 主线程被 CPU、同步代码或 GC 阻塞
#   结果单位是 seconds，>= 0.2 表示 p95 lag 超过 200ms。
histogram_quantile(
  0.95,
  sum by (le) (rate(nodejs_eventloop_lag_seconds_bucket{service="api"}[5m]))
) >= 0.2

# 计算 V8 heap 使用率:
#   used bytes / total bytes = heap 使用比例
#   >= 0.85 表示 heap 已经用了 85%，需要排查内存泄漏或调大内存。
nodejs_heap_size_used_bytes{service="api"}
/
nodejs_heap_size_total_bytes{service="api"}
>= 0.85

4. vmalert Rules#

groups:
  - name: ecs-nodejs.rules
    rules:
      - alert: ECSServiceRunningTaskTooLow
        # desired > running 时触发，表示 ECS service 没有达到期望任务数。
        expr: aws_ecs_running_task_count_average < aws_ecs_desired_task_count_average
        for: 5m
        labels:
          severity: P1
          component: ecs
        annotations:
          summary: "ECS running task count is below desired count"

      - alert: NodeJSHigh5xxRate
        # 5xx rate = 5xx 请求速率 / 总请求速率 * 100。
        # by (service) 是为了每个服务单独计算，不把多个服务混在一起。
        expr: |
          100 * sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service) >= 5
        for: 5m
        labels:
          severity: P1
          component: app
        annotations:
          summary: "Node.js HTTP 5xx rate is >= 5%"

      - alert: NodeJSHighP95Latency
        # 从 histogram bucket 计算每个 service 的 p95 延迟，单位 seconds。
        expr: |
          histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) >= 1
        for: 10m
        labels:
          severity: P1
          component: app
        annotations:
          summary: "Node.js HTTP p95 latency is >= 1s"