Links#
https://docs.victoriametrics.com/vmalert/
https://docs.victoriametrics.com/vmalert/#alerting-rules
https://docs.victoriametrics.com/vmalert/#notifier
https://docs.victoriametrics.com/vmalert/#recording-rules
https://prometheus.io/docs/alerting/latest/alertmanager/1. Important Points#
vmalert evaluates alerting and recording rules against VictoriaMetrics or another MetricsQL/PromQL-compatible datasource. It commonly sends alerts to Alertmanager.
vmalert 用来做:
alerting rules
recording rules
MetricsQL expressions
alerts from VictoriaMetrics data
long-range / cloud metrics rules when data is in VictoriaMetrics
vmalert 不负责:
final notification routing
silence management by itself
incident workflowTypical path:
vmagent / Prometheus scrape
-> VictoriaMetrics
-> vmalert
-> Alertmanager
-> notification channel2. Runtime Config#
Basic command:
vmalert \
-datasource.url=http://victoriametrics:8428 \
-notifier.url=http://alertmanager:9093 \
-rule=/etc/vmalert/rules/*.yaml \
-external.url=https://vmalert.example.com \
-evaluationInterval=30sDocker Compose example:
services:
vmalert:
image: victoriametrics/vmalert:<version>
command:
- -datasource.url=http://victoriametrics:8428
- -notifier.url=http://alertmanager:9093
- -rule=/etc/vmalert/rules/*.yaml
- -external.url=https://vmalert.example.com
- -evaluationInterval=30s
volumes:
- ./rules:/etc/vmalert/rules:ro
ports:
- "8880:8880"3. Rule Standard#
vmalert rule format is Prometheus-compatible.
groups:
- name: alb.rules
interval: 30s
rules:
- alert: P1ProdAlbTarget5xxRateHigh
expr: |
100 *
sum(rate(aws_applicationelb_httpcode_target_5xx_count_sum{env="prod", service="api"}[5m]))
/
clamp_min(sum(rate(aws_applicationelb_request_count_sum{env="prod", service="api"}[5m])), 1)
>= 5
for: 5m
labels:
severity: P1
env: prod
service: api
team: platform
resource_type: targetgroup
resource: prod-api
annotations:
summary: "ALB target 5xx rate is high"
dashboard_url: "https://grafana.example.com/d/alb"
runbook_url: "https://docs.example.com/runbooks/alb-target-5xx"4. Recording Rules#
Use recording rules for repeated expensive expressions.
groups:
- name: recording.rules
interval: 30s
rules:
- record: service:http_5xx_rate_percent:5m
expr: |
100 *
sum by (env, service) (rate(http_requests_total{status=~"5.."}[5m]))
/
clamp_min(sum by (env, service) (rate(http_requests_total[5m])), 1)Then alert on recorded metric:
- alert: P1ProdApiHigh5xxRate
expr: service:http_5xx_rate_percent:5m{env="prod", service="api"} >= 5
for: 5m
labels:
severity: P1
env: prod
service: api
team: platform5. Cloud Metrics Pattern#
For AWS metrics scraped by YACE into VictoriaMetrics:
CloudWatch
-> YACE
-> vmagent / Prometheus scrape
-> VictoriaMetrics
-> vmalert
-> AlertmanagerRules:
CloudWatch scrape interval:
usually 300s
vmalert evaluation interval:
can be 30s/60s, but cloud metric freshness is still 300s
for duration:
usually >= 2 * cloud scrape intervalExample SQS old message:
- alert: P1ProdSqsOldestMessageTooOld
expr: aws_sqs_approximate_age_of_oldest_message_maximum{env="prod", queue_name="orders"} >= 300
for: 10m
labels:
severity: P1
env: prod
service: order-worker
team: platform
resource_type: sqs-queue
resource: orders
annotations:
summary: "SQS orders oldest message age is too high"
runbook_url: "https://docs.example.com/runbooks/sqs-backlog"6. Verify#
Check vmalert health:
curl -s http://vmalert:8880/metrics | grep 'vmalert_'
curl -s http://vmalert:8880/api/v1/rules
curl -s http://vmalert:8880/api/v1/alertsCheck Alertmanager received alerts:
curl -s http://alertmanager:9093/api/v2/alerts7. Production Checklist#
rules:
labels match alerting standard
runbook/dashboard annotations exist
recording rules used for expensive expressions
cloud metric rules account for scrape delay
vmalert:
datasource reachable
notifier reachable
rule files mounted read-only
evaluation interval documented
vmalert own metrics scraped
operations:
rule validation in CI
Alertmanager HA if alerts are critical
vmalert logs monitored
failed evaluations alert exists