Alertmanager#

Alertmanager 负责接收 Prometheus / vmalert 发送过来的 alerts,然后完成分组、抑制、静默、路由和通知。

1. Core Concepts#

Concept Meaning
route 告警路由树,决定不同 alert 发到哪里
receiver 通知接收方,比如 Slack、Email、Webhook、PagerDuty
group_by 按哪些 label 聚合告警
group_wait 第一条告警触发后,等待多久再发送
group_interval 同一个 group 新增告警后,多久再发送
repeat_interval 持续 firing 的告警多久重复通知
inhibit_rules 当高级别告警存在时,抑制低级别告警
silence 手工静默某些告警
template 通知内容模板

2. Basic Config#

global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by:
    - alertname
    - service
    - env
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - receiver: critical
      matchers:
        - severity="critical"

receivers:
  - name: default
    webhook_configs:
      - url: http://localhost:5001/alertmanager
        send_resolved: true

  - name: critical
    webhook_configs:
      - url: http://localhost:5001/critical
        send_resolved: true

3. Inhibition#

当同一个服务已经触发 critical 告警时,可以抑制 warning 告警,减少重复噪音。

inhibit_rules:
  - source_matchers:
      - severity="critical"
    target_matchers:
      - severity="warning"
    equal:
      - alertname
      - service
      - env

4. Alert Labels Standard#

required labels:
    alertname
    severity
    service
    env

recommended labels:
    team
    region
    cluster
    namespace
labels:
  severity: critical
  service: order-api
  env: prod
  team: platform

5. Alert Annotations Standard#

summary:
    一句话说明问题

description:
    说明影响范围、当前值、阈值、持续时间

runbook_url:
    指向排障文档

dashboard_url:
    指向 Grafana dashboard
annotations:
  summary: "High HTTP error rate"
  description: "order-api 5xx error rate is higher than 5% for 5 minutes."
  runbook_url: "https://wiki.example.com/runbooks/order-api-5xx"
  dashboard_url: "https://grafana.example.com/d/order-api"

6. Common Notification Template#

这个模板适合 webhook / Slack / Feishu / DingTalk 二次改造时复用。

{{ define "alert.title" -}}
[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}
{{- end }}

{{ define "alert.text" -}}
{{ range .Alerts }}
Alert: {{ .Labels.alertname }}
Status: {{ .Status }}
Severity: {{ .Labels.severity }}
Service: {{ .Labels.service }}
Env: {{ .Labels.env }}
Summary: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ if .Annotations.dashboard_url }}Dashboard: {{ .Annotations.dashboard_url }}{{ end }}
{{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }}
StartsAt: {{ .StartsAt }}
{{ if eq .Status "resolved" }}EndsAt: {{ .EndsAt }}{{ end }}

{{ end }}
{{- end }}

Alertmanager config:

templates:
  - /etc/alertmanager/templates/*.tmpl

7. Best Practices#

Config:
    alertmanager.yml 放进 Git 管理
    receiver 名称用 team / channel / purpose 命名
    critical 和 warning 使用不同 route

Routing:
    默认 receiver 兜底
    按 severity / team / service 分发
    高优先级告警走电话 / PagerDuty / Opsgenie
    普通告警走 Slack / Email / IM

Noise Control:
    合理设置 group_by
    用 inhibition 减少重复通知
    用 silence 做维护窗口
    避免每个 instance 单独发一条消息

Template:
    title 简短
    body 包含 service / env / severity / summary / runbook / dashboard
    firing 和 resolved 都要可读

Security:
    webhook url / token 不要提交到 Git
    使用 Secret / env / external secret 管理敏感信息

8. References#