vmalert


https://docs.victoriametrics.com/vmalert/
https://docs.victoriametrics.com/vmalert/#alerting-rules
https://docs.victoriametrics.com/vmalert/#notifier
https://docs.victoriametrics.com/vmalert/#recording-rules
https://prometheus.io/docs/alerting/latest/alertmanager/

1. Important Points#

vmalert evaluates alerting and recording rules against VictoriaMetrics or another MetricsQL/PromQL-compatible datasource. It commonly sends alerts to Alertmanager.

vmalert 用来做:
    alerting rules
    recording rules
    MetricsQL expressions
    alerts from VictoriaMetrics data
    long-range / cloud metrics rules when data is in VictoriaMetrics

vmalert 不负责:
    final notification routing
    silence management by itself
    incident workflow

Typical path:

vmagent / Prometheus scrape
  -> VictoriaMetrics
  -> vmalert
  -> Alertmanager
  -> notification channel

2. Runtime Config#

Basic command:

vmalert \
  -datasource.url=http://victoriametrics:8428 \
  -notifier.url=http://alertmanager:9093 \
  -rule=/etc/vmalert/rules/*.yaml \
  -external.url=https://vmalert.example.com \
  -evaluationInterval=30s

Docker Compose example:

services:
  vmalert:
    image: victoriametrics/vmalert:<version>
    command:
      - -datasource.url=http://victoriametrics:8428
      - -notifier.url=http://alertmanager:9093
      - -rule=/etc/vmalert/rules/*.yaml
      - -external.url=https://vmalert.example.com
      - -evaluationInterval=30s
    volumes:
      - ./rules:/etc/vmalert/rules:ro
    ports:
      - "8880:8880"

3. Rule Standard#

vmalert rule format is Prometheus-compatible.

groups:
  - name: alb.rules
    interval: 30s
    rules:
      - alert: P1ProdAlbTarget5xxRateHigh
        expr: |
          100 *
          sum(rate(aws_applicationelb_httpcode_target_5xx_count_sum{env="prod", service="api"}[5m]))
          /
          clamp_min(sum(rate(aws_applicationelb_request_count_sum{env="prod", service="api"}[5m])), 1)
          >= 5
        for: 5m
        labels:
          severity: P1
          env: prod
          service: api
          team: platform
          resource_type: targetgroup
          resource: prod-api
        annotations:
          summary: "ALB target 5xx rate is high"
          dashboard_url: "https://grafana.example.com/d/alb"
          runbook_url: "https://docs.example.com/runbooks/alb-target-5xx"

4. Recording Rules#

Use recording rules for repeated expensive expressions.

groups:
  - name: recording.rules
    interval: 30s
    rules:
      - record: service:http_5xx_rate_percent:5m
        expr: |
          100 *
          sum by (env, service) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          clamp_min(sum by (env, service) (rate(http_requests_total[5m])), 1)

Then alert on recorded metric:

- alert: P1ProdApiHigh5xxRate
  expr: service:http_5xx_rate_percent:5m{env="prod", service="api"} >= 5
  for: 5m
  labels:
    severity: P1
    env: prod
    service: api
    team: platform

5. Cloud Metrics Pattern#

For AWS metrics scraped by YACE into VictoriaMetrics:

CloudWatch
  -> YACE
  -> vmagent / Prometheus scrape
  -> VictoriaMetrics
  -> vmalert
  -> Alertmanager

Rules:

CloudWatch scrape interval:
    usually 300s

vmalert evaluation interval:
    can be 30s/60s, but cloud metric freshness is still 300s

for duration:
    usually >= 2 * cloud scrape interval

Example SQS old message:

- alert: P1ProdSqsOldestMessageTooOld
  expr: aws_sqs_approximate_age_of_oldest_message_maximum{env="prod", queue_name="orders"} >= 300
  for: 10m
  labels:
    severity: P1
    env: prod
    service: order-worker
    team: platform
    resource_type: sqs-queue
    resource: orders
  annotations:
    summary: "SQS orders oldest message age is too high"
    runbook_url: "https://docs.example.com/runbooks/sqs-backlog"

6. Verify#

Check vmalert health:

curl -s http://vmalert:8880/metrics | grep 'vmalert_'
curl -s http://vmalert:8880/api/v1/rules
curl -s http://vmalert:8880/api/v1/alerts

Check Alertmanager received alerts:

curl -s http://alertmanager:9093/api/v2/alerts

7. Production Checklist#

rules:
    labels match alerting standard
    runbook/dashboard annotations exist
    recording rules used for expensive expressions
    cloud metric rules account for scrape delay

vmalert:
    datasource reachable
    notifier reachable
    rule files mounted read-only
    evaluation interval documented
    vmalert own metrics scraped

operations:
    rule validation in CI
    Alertmanager HA if alerts are critical
    vmalert logs monitored
    failed evaluations alert exists