vmalert#
vmalert 是 VictoriaMetrics 生态中的 rule evaluator,用来执行:
recording rules:
周期性计算 PromQL / MetricsQL,写回 VictoriaMetrics
alerting rules:
周期性计算告警表达式,触发后发送到 Alertmanager1. Architecture#
vmagent / Prometheus
-> remote_write
-> VictoriaMetrics
<- query
<- vmalert
-> Alertmanager
-> Slack / Email / Webhook / PagerDuty / etc.2. Install#
Docker#
docker run -d \
--name vmalert \
-p 8880:8880 \
-v ./rules:/rules \
victoriametrics/vmalert:latest \
-datasource.url=http://host.docker.internal:8428 \
-remoteWrite.url=http://host.docker.internal:8428/api/v1/write \
-notifier.url=http://host.docker.internal:9093 \
-rule=/rules/*.ymlBinary#
vmalert \
-datasource.url=http://localhost:8428 \
-remoteWrite.url=http://localhost:8428/api/v1/write \
-notifier.url=http://localhost:9093 \
-rule=./rules/*.yml3. vmalert Rules#
vmalert rule format is compatible with Prometheus rule files.
groups:
- name: api-alerts
interval: 30s
rules:
- alert: HighHttpErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High HTTP error rate"
description: "5xx error rate is higher than 5% for 5 minutes."4. Recording Rule Example#
groups:
- name: api-recording-rules
interval: 30s
rules:
- record: service:http_requests:rate5m
expr: |
sum by (service, route, method, status) (
rate(http_requests_total[5m])
)query after recording:
service:http_requests:rate5m{service="order-api"}5. Best Practices#
Rule Files:
rules 放进 Git 管理
按 domain / service / team 拆分文件
alerting rules 和 recording rules 可以分开目录
Alert Labels:
必须有 severity
推荐有 team / service / env
不要把动态值放进 label
Annotations:
summary 写一句话问题
description 写影响范围和当前值
runbook_url 指向排障文档
dashboard_url 指向 Grafana 面板
Query:
alert expression 要可读
高成本查询先做 recording rule
ratio 类告警要同时考虑分母太小的问题
Delivery:
vmalert 只负责触发告警
通知路由、静默、抑制交给 Alertmanager6. Useful Alerts#
groups:
- name: service-sli-alerts
rules:
- alert: HighP95Latency
expr: |
histogram_quantile(
0.95,
sum by (le, service, route) (
rate(http_request_duration_seconds_bucket[5m])
)
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High p95 latency"
description: "p95 latency is higher than 1s for 10 minutes."7. References#
- vmalert docs: https://docs.victoriametrics.com/victoriametrics/vmalert/
- Prometheus alerting rules: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/