Alertmanager#
Alertmanager 负责接收 Prometheus / vmalert 发送过来的 alerts,然后完成分组、抑制、静默、路由和通知。
1. Core Concepts#
| Concept | Meaning |
|---|---|
route |
告警路由树,决定不同 alert 发到哪里 |
receiver |
通知接收方,比如 Slack、Email、Webhook、PagerDuty |
group_by |
按哪些 label 聚合告警 |
group_wait |
第一条告警触发后,等待多久再发送 |
group_interval |
同一个 group 新增告警后,多久再发送 |
repeat_interval |
持续 firing 的告警多久重复通知 |
inhibit_rules |
当高级别告警存在时,抑制低级别告警 |
silence |
手工静默某些告警 |
template |
通知内容模板 |
2. Basic Config#
global:
resolve_timeout: 5m
route:
receiver: default
group_by:
- alertname
- service
- env
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: critical
matchers:
- severity="critical"
receivers:
- name: default
webhook_configs:
- url: http://localhost:5001/alertmanager
send_resolved: true
- name: critical
webhook_configs:
- url: http://localhost:5001/critical
send_resolved: true3. Inhibition#
当同一个服务已经触发 critical 告警时,可以抑制 warning 告警,减少重复噪音。
inhibit_rules:
- source_matchers:
- severity="critical"
target_matchers:
- severity="warning"
equal:
- alertname
- service
- env4. Alert Labels Standard#
required labels:
alertname
severity
service
env
recommended labels:
team
region
cluster
namespacelabels:
severity: critical
service: order-api
env: prod
team: platform5. Alert Annotations Standard#
summary:
一句话说明问题
description:
说明影响范围、当前值、阈值、持续时间
runbook_url:
指向排障文档
dashboard_url:
指向 Grafana dashboardannotations:
summary: "High HTTP error rate"
description: "order-api 5xx error rate is higher than 5% for 5 minutes."
runbook_url: "https://wiki.example.com/runbooks/order-api-5xx"
dashboard_url: "https://grafana.example.com/d/order-api"6. Common Notification Template#
这个模板适合 webhook / Slack / Feishu / DingTalk 二次改造时复用。
{{ define "alert.title" -}}
[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}
{{- end }}
{{ define "alert.text" -}}
{{ range .Alerts }}
Alert: {{ .Labels.alertname }}
Status: {{ .Status }}
Severity: {{ .Labels.severity }}
Service: {{ .Labels.service }}
Env: {{ .Labels.env }}
Summary: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ if .Annotations.dashboard_url }}Dashboard: {{ .Annotations.dashboard_url }}{{ end }}
{{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }}
StartsAt: {{ .StartsAt }}
{{ if eq .Status "resolved" }}EndsAt: {{ .EndsAt }}{{ end }}
{{ end }}
{{- end }}Alertmanager config:
templates:
- /etc/alertmanager/templates/*.tmpl7. Best Practices#
Config:
alertmanager.yml 放进 Git 管理
receiver 名称用 team / channel / purpose 命名
critical 和 warning 使用不同 route
Routing:
默认 receiver 兜底
按 severity / team / service 分发
高优先级告警走电话 / PagerDuty / Opsgenie
普通告警走 Slack / Email / IM
Noise Control:
合理设置 group_by
用 inhibition 减少重复通知
用 silence 做维护窗口
避免每个 instance 单独发一条消息
Template:
title 简短
body 包含 service / env / severity / summary / runbook / dashboard
firing 和 resolved 都要可读
Security:
webhook url / token 不要提交到 Git
使用 Secret / env / external secret 管理敏感信息8. References#
- Alertmanager configuration: https://prometheus.io/docs/alerting/latest/configuration/
- Notification templates: https://prometheus.io/docs/alerting/latest/notifications/
- Prometheus alerting overview: https://prometheus.io/docs/alerting/latest/overview/