vmalert

vmalert#

vmalert 是 VictoriaMetrics 生态中的 rule evaluator,用来执行:

recording rules:
    周期性计算 PromQL / MetricsQL,写回 VictoriaMetrics

alerting rules:
    周期性计算告警表达式,触发后发送到 Alertmanager

1. Architecture#

vmagent / Prometheus
    -> remote_write
    -> VictoriaMetrics
    <- query
    <- vmalert
    -> Alertmanager
    -> Slack / Email / Webhook / PagerDuty / etc.

2. Install#

Docker#

docker run -d \
  --name vmalert \
  -p 8880:8880 \
  -v ./rules:/rules \
  victoriametrics/vmalert:latest \
  -datasource.url=http://host.docker.internal:8428 \
  -remoteWrite.url=http://host.docker.internal:8428/api/v1/write \
  -notifier.url=http://host.docker.internal:9093 \
  -rule=/rules/*.yml

Binary#

vmalert \
  -datasource.url=http://localhost:8428 \
  -remoteWrite.url=http://localhost:8428/api/v1/write \
  -notifier.url=http://localhost:9093 \
  -rule=./rules/*.yml

3. vmalert Rules#

vmalert rule format is compatible with Prometheus rule files.

groups:
  - name: api-alerts
    interval: 30s
    rules:
      - alert: HighHttpErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High HTTP error rate"
          description: "5xx error rate is higher than 5% for 5 minutes."

4. Recording Rule Example#

groups:
  - name: api-recording-rules
    interval: 30s
    rules:
      - record: service:http_requests:rate5m
        expr: |
          sum by (service, route, method, status) (
            rate(http_requests_total[5m])
          )

query after recording:

service:http_requests:rate5m{service="order-api"}

5. Best Practices#

Rule Files:
    rules 放进 Git 管理
    按 domain / service / team 拆分文件
    alerting rules 和 recording rules 可以分开目录

Alert Labels:
    必须有 severity
    推荐有 team / service / env
    不要把动态值放进 label

Annotations:
    summary 写一句话问题
    description 写影响范围和当前值
    runbook_url 指向排障文档
    dashboard_url 指向 Grafana 面板

Query:
    alert expression 要可读
    高成本查询先做 recording rule
    ratio 类告警要同时考虑分母太小的问题

Delivery:
    vmalert 只负责触发告警
    通知路由、静默、抑制交给 Alertmanager

6. Useful Alerts#

groups:
  - name: service-sli-alerts
    rules:
      - alert: HighP95Latency
        expr: |
          histogram_quantile(
            0.95,
            sum by (le, service, route) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency"
          description: "p95 latency is higher than 1s for 10 minutes."

7. References#