Links#
https://docs.aws.amazon.com/AmazonElastiCache/latest/dg/CacheMetrics.Redis.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/1. Alert Standard#
| Severity | Alert | Meaning | Why Monitor | Definition | Duration |
|---|---|---|---|---|---|
| P1 | Engine CPU high | Valkey/Redis 引擎线程 CPU 持续高 | 单线程热点会直接导致命令延迟和吞吐下降 | EngineCPUUtilization >= 80% |
10m |
| P1 | Memory critical | 数据库内存使用率接近上限 | 防止 OOM、写入失败和 eviction 风险 | DatabaseMemoryUsagePercentage >= 90% |
10m |
| P1 | FreeableMemory low | 节点可释放内存不足 | 识别节点内存压力和系统层稳定性风险 | < 512MiB or < 10% node memory |
10m |
| P1 | Evictions | 缓存开始驱逐 key | 说明内存不足或 maxmemory 策略触发,可能影响命中率和业务正确性 | Evictions > 0 |
5m |
| P1 | Replication lag | replica 落后 primary 的秒数高 | 影响读副本新鲜度和故障切换数据风险 | ReplicationLag >= 10s |
5m |
| P2 | Cache hit low | 缓存命中率偏低 | 命中率低会增加数据库压力和请求延迟 | CacheHitRate < 80% |
15m |
| P2 | Swap usage | 节点出现 swap 使用 | swap 会显著增加延迟,通常说明内存压力异常 | SwapUsage > 0 |
10m |
| P2 | Connections near limit | 当前连接数接近连接上限 | 防止新连接失败和客户端连接池耗尽 | CurrConnections >= 80% connection_limit |
10m |
Cache hit rate:
CacheHitRate = 100 * CacheHits / (CacheHits + CacheMisses)
low = < 80% for 15m2. CloudWatch Metrics#
| Metric | Meaning | Why Monitor | Namespace | Statistic | Period |
|---|---|---|---|---|---|
| EngineCPUUtilization | Valkey/Redis 引擎 CPU 使用率 | 判断命令执行线程是否成为瓶颈 | AWS/ElastiCache | Average | 60s |
| DatabaseMemoryUsagePercentage | 数据库内存使用百分比 | 判断 key/value 数据是否接近内存上限 | AWS/ElastiCache | Average | 60s |
| FreeableMemory | 节点可释放内存 | 判断系统内存压力和节点稳定性 | AWS/ElastiCache | Average | 60s |
| Evictions | 周期内被驱逐的 key 数量 | 发现内存不足导致缓存数据被迫清理 | AWS/ElastiCache | Sum | 300s |
| ReplicationLag | replica 相对 primary 的复制延迟 | 判断读副本一致性和故障切换风险 | AWS/ElastiCache | Maximum | 60s |
| CacheHits | 缓存命中次数 | 计算命中率并判断缓存价值 | AWS/ElastiCache | Sum | 300s |
| CacheMisses | 缓存未命中次数 | 计算命中率并发现缓存穿透/失效问题 | AWS/ElastiCache | Sum | 300s |
| CurrConnections | 当前客户端连接数 | 判断连接池压力和连接上限风险 | AWS/ElastiCache | Average | 60s |
Cache hit rate metric math:
[
{
"Id": "hits",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ElastiCache",
"MetricName": "CacheHits",
"Dimensions": [{ "Name": "CacheClusterId", "Value": "prod-valkey-001" }]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "misses",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ElastiCache",
"MetricName": "CacheMisses",
"Dimensions": [{ "Name": "CacheClusterId", "Value": "prod-valkey-001" }]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "hit_rate",
"Expression": "IF((hits+misses)>0,100*hits/(hits+misses),100)",
"Label": "Cache hit rate",
"ReturnData": true
}
]3. PromQL#
# 用正则先确认 YACE 暴露的 ElastiCache metric 名称。
# 不同 exporter 配置可能使用 cache_cluster_id、replication_group_id 等 label。
{__name__=~"aws_elasticache_.*(cpu|memory|evictions|lag|hits|misses|connections).*"}# Valkey 引擎线程 CPU 超过 80%。
# Redis/Valkey 主线程压力高时,即使 EC2 CPU 不高,也可能出现延迟。
aws_elasticache_engine_cpu_utilization_average{cache_cluster_id="prod-valkey-001"} >= 80# Valkey 数据内存使用率超过 90%。
# 接近 maxmemory 后容易触发 eviction 或写入失败。
aws_elasticache_database_memory_usage_percentage_average{cache_cluster_id="prod-valkey-001"} >= 90# 最近 5 分钟发生过 eviction。
# eviction 表示 key 被内存策略淘汰,生产缓存通常至少 P1。
increase(aws_elasticache_evictions_sum{cache_cluster_id="prod-valkey-001"}[5m]) > 0# 计算 15 分钟 cache hit rate:
# 分子 = 最近 15 分钟 cache hits 增量
# 分母 = hits 增量 + misses 增量
# clamp_min(..., 1) 防止没有请求时除以 0
# < 80 表示命中率低于 80%。
100 *
increase(aws_elasticache_cache_hits_sum{cache_cluster_id="prod-valkey-001"}[15m])
/
clamp_min(
increase(aws_elasticache_cache_hits_sum{cache_cluster_id="prod-valkey-001"}[15m])
+
increase(aws_elasticache_cache_misses_sum{cache_cluster_id="prod-valkey-001"}[15m]),
1
)
< 804. vmalert Rules#
groups:
- name: elasticache-valkey.rules
rules:
- alert: ValkeyMemoryCritical
# Valkey 数据内存使用率超过 90%,接近 maxmemory。
expr: aws_elasticache_database_memory_usage_percentage_average >= 90
for: 10m
labels:
severity: P1
component: valkey
annotations:
summary: "Valkey memory usage is >= 90%"
- alert: ValkeyEvictions
# 最近 5 分钟 eviction 增量大于 0。
expr: increase(aws_elasticache_evictions_sum[5m]) > 0
for: 5m
labels:
severity: P1
component: valkey
annotations:
summary: "Valkey evictions occurred"