Links#
https://prometheus.io/docs/practices/alerting/
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config
https://prometheus.io/docs/alerting/latest/alertmanager/1. Important Points#
Prometheus alerting rules are evaluated by Prometheus and sent to Alertmanager. Prometheus decides firing/resolved; Alertmanager handles grouping, routing, silencing, inhibition and notifications.
Prometheus:
evaluates alert expressions
applies for duration
attaches labels and annotations
sends alerts to Alertmanager
Alertmanager:
groups alerts
routes by labels
deduplicates notifications
handles silence and inhibition
sends notificationsUse Prometheus alerts for:
service metrics
host metrics
blackbox probes
application SLO symptoms
cloud metrics scraped by exporters2. Rule Standard#
Rule file:
groups:
- name: api.rules
rules:
- alert: P1ProdApiHigh5xxRate
expr: |
100 *
sum(rate(http_requests_total{env="prod", service="api", status=~"5.."}[5m]))
/
clamp_min(sum(rate(http_requests_total{env="prod", service="api"}[5m])), 1)
>= 5
for: 5m
labels:
severity: P1
env: prod
service: api
team: platform
resource_type: service
resource: api
annotations:
summary: "API 5xx rate is high"
description: "API 5xx rate is {{ $value | printf \"%.2f\" }}% for 5m."
dashboard_url: "https://grafana.example.com/d/api"
runbook_url: "https://docs.example.com/runbooks/api-5xx"Rules:
labels:
used for routing and grouping
should be stable and low-cardinality
annotations:
used for human context
can include dynamic values
should include dashboard_url and runbook_url for P0/P13. Prometheus Config#
Prometheus sends alerts to Alertmanager:
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-1:9093
- alertmanager-2:9093
rule_files:
- /etc/prometheus/rules/*.yamlValidate rules:
promtool check rules /etc/prometheus/rules/*.yamlReload:
curl -X POST http://prometheus:9090/-/reload4. Alertmanager Routing#
Route by severity, env, team.
route:
receiver: default
group_by:
- env
- service
- alertname
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- severity=~"P0|P1"
- env="prod"
receiver: pagerduty-prod
continue: true
- matchers:
- team="platform"
receiver: platform-slack
receivers:
- name: default
- name: platform-slack
slack_configs:
- api_url_file: /etc/alertmanager/secrets/slack-webhook
channel: "#platform-alerts"
send_resolved: true
title: '{{ .CommonLabels.severity }} {{ .CommonLabels.env }} {{ .CommonLabels.service }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }} {{ .Annotations.runbook_url }}{{ "\n" }}{{ end }}'
- name: pagerduty-prod
pagerduty_configs:
- routing_key_file: /etc/alertmanager/secrets/pagerduty-routing-key
send_resolved: true5. Inhibition#
Use inhibition to reduce duplicate alerts.
inhibit_rules:
- source_matchers:
- alertname="P0ProdApiDown"
target_matchers:
- severity=~"P1|P2"
- service="api"
equal:
- env
- servicePattern:
when:
service-down fires
mute:
latency high
5xx high
exporter down for same service6. Common Alerts#
Instance down:
- alert: P1ProdInstanceDown
expr: up{env="prod"} == 0
for: 3m
labels:
severity: P1
env: prod
team: platform
resource_type: instance
annotations:
summary: "Prometheus target is down: {{ $labels.job }} {{ $labels.instance }}"
runbook_url: "https://docs.example.com/runbooks/target-down"Blackbox probe failed:
- alert: P0ProdApiBlackboxFailed
expr: probe_success{env="prod", service="api"} == 0
for: 3m
labels:
severity: P0
env: prod
service: api
team: platform
resource_type: endpoint
annotations:
summary: "API blackbox probe failed"
dashboard_url: "https://grafana.example.com/d/blackbox"
runbook_url: "https://docs.example.com/runbooks/api-down"7. Production Checklist#
rules:
promtool check rules in CI
for duration set for every paging alert
labels follow standard
annotations include runbook/dashboard
alertmanager:
HA pair configured
route by severity/env/team
send_resolved enabled for P0/P1
secrets mounted from file or secret store
silences audited
quality:
every page is actionable
low-cardinality labels only
inhibition prevents alert storms
noisy alerts reviewed regularly