AWS ElastiCache Valkey Monitoring


https://docs.aws.amazon.com/AmazonElastiCache/latest/dg/CacheMetrics.Redis.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

1. Alert Standard#

Severity Alert Meaning Why Monitor Definition Duration
P1 Engine CPU high Valkey/Redis 引擎线程 CPU 持续高 单线程热点会直接导致命令延迟和吞吐下降 EngineCPUUtilization >= 80% 10m
P1 Memory critical 数据库内存使用率接近上限 防止 OOM、写入失败和 eviction 风险 DatabaseMemoryUsagePercentage >= 90% 10m
P1 FreeableMemory low 节点可释放内存不足 识别节点内存压力和系统层稳定性风险 < 512MiB or < 10% node memory 10m
P1 Evictions 缓存开始驱逐 key 说明内存不足或 maxmemory 策略触发,可能影响命中率和业务正确性 Evictions > 0 5m
P1 Replication lag replica 落后 primary 的秒数高 影响读副本新鲜度和故障切换数据风险 ReplicationLag >= 10s 5m
P2 Cache hit low 缓存命中率偏低 命中率低会增加数据库压力和请求延迟 CacheHitRate < 80% 15m
P2 Swap usage 节点出现 swap 使用 swap 会显著增加延迟,通常说明内存压力异常 SwapUsage > 0 10m
P2 Connections near limit 当前连接数接近连接上限 防止新连接失败和客户端连接池耗尽 CurrConnections >= 80% connection_limit 10m

Cache hit rate:

CacheHitRate = 100 * CacheHits / (CacheHits + CacheMisses)
low = < 80% for 15m

2. CloudWatch Metrics#

Metric Meaning Why Monitor Namespace Statistic Period
EngineCPUUtilization Valkey/Redis 引擎 CPU 使用率 判断命令执行线程是否成为瓶颈 AWS/ElastiCache Average 60s
DatabaseMemoryUsagePercentage 数据库内存使用百分比 判断 key/value 数据是否接近内存上限 AWS/ElastiCache Average 60s
FreeableMemory 节点可释放内存 判断系统内存压力和节点稳定性 AWS/ElastiCache Average 60s
Evictions 周期内被驱逐的 key 数量 发现内存不足导致缓存数据被迫清理 AWS/ElastiCache Sum 300s
ReplicationLag replica 相对 primary 的复制延迟 判断读副本一致性和故障切换风险 AWS/ElastiCache Maximum 60s
CacheHits 缓存命中次数 计算命中率并判断缓存价值 AWS/ElastiCache Sum 300s
CacheMisses 缓存未命中次数 计算命中率并发现缓存穿透/失效问题 AWS/ElastiCache Sum 300s
CurrConnections 当前客户端连接数 判断连接池压力和连接上限风险 AWS/ElastiCache Average 60s

Cache hit rate metric math:

[
  {
    "Id": "hits",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/ElastiCache",
        "MetricName": "CacheHits",
        "Dimensions": [{ "Name": "CacheClusterId", "Value": "prod-valkey-001" }]
      },
      "Period": 300,
      "Stat": "Sum"
    },
    "ReturnData": false
  },
  {
    "Id": "misses",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/ElastiCache",
        "MetricName": "CacheMisses",
        "Dimensions": [{ "Name": "CacheClusterId", "Value": "prod-valkey-001" }]
      },
      "Period": 300,
      "Stat": "Sum"
    },
    "ReturnData": false
  },
  {
    "Id": "hit_rate",
    "Expression": "IF((hits+misses)>0,100*hits/(hits+misses),100)",
    "Label": "Cache hit rate",
    "ReturnData": true
  }
]

3. PromQL#

# 用正则先确认 YACE 暴露的 ElastiCache metric 名称。
# 不同 exporter 配置可能使用 cache_cluster_id、replication_group_id 等 label。
{__name__=~"aws_elasticache_.*(cpu|memory|evictions|lag|hits|misses|connections).*"}
# Valkey 引擎线程 CPU 超过 80%。
# Redis/Valkey 主线程压力高时,即使 EC2 CPU 不高,也可能出现延迟。
aws_elasticache_engine_cpu_utilization_average{cache_cluster_id="prod-valkey-001"} >= 80
# Valkey 数据内存使用率超过 90%。
# 接近 maxmemory 后容易触发 eviction 或写入失败。
aws_elasticache_database_memory_usage_percentage_average{cache_cluster_id="prod-valkey-001"} >= 90
# 最近 5 分钟发生过 eviction。
# eviction 表示 key 被内存策略淘汰,生产缓存通常至少 P1。
increase(aws_elasticache_evictions_sum{cache_cluster_id="prod-valkey-001"}[5m]) > 0
# 计算 15 分钟 cache hit rate:
#   分子 = 最近 15 分钟 cache hits 增量
#   分母 = hits 增量 + misses 增量
#   clamp_min(..., 1) 防止没有请求时除以 0
# < 80 表示命中率低于 80%。
100 *
increase(aws_elasticache_cache_hits_sum{cache_cluster_id="prod-valkey-001"}[15m])
/
clamp_min(
  increase(aws_elasticache_cache_hits_sum{cache_cluster_id="prod-valkey-001"}[15m])
  +
  increase(aws_elasticache_cache_misses_sum{cache_cluster_id="prod-valkey-001"}[15m]),
  1
)
< 80

4. vmalert Rules#

groups:
  - name: elasticache-valkey.rules
    rules:
      - alert: ValkeyMemoryCritical
        # Valkey 数据内存使用率超过 90%,接近 maxmemory。
        expr: aws_elasticache_database_memory_usage_percentage_average >= 90
        for: 10m
        labels:
          severity: P1
          component: valkey
        annotations:
          summary: "Valkey memory usage is >= 90%"

      - alert: ValkeyEvictions
        # 最近 5 分钟 eviction 增量大于 0。
        expr: increase(aws_elasticache_evictions_sum[5m]) > 0
        for: 5m
        labels:
          severity: P1
          component: valkey
        annotations:
          summary: "Valkey evictions occurred"