Prometheus Alerting


https://prometheus.io/docs/practices/alerting/
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config
https://prometheus.io/docs/alerting/latest/alertmanager/

1. Important Points#

Prometheus alerting rules are evaluated by Prometheus and sent to Alertmanager. Prometheus decides firing/resolved; Alertmanager handles grouping, routing, silencing, inhibition and notifications.

Prometheus:
    evaluates alert expressions
    applies for duration
    attaches labels and annotations
    sends alerts to Alertmanager

Alertmanager:
    groups alerts
    routes by labels
    deduplicates notifications
    handles silence and inhibition
    sends notifications

Use Prometheus alerts for:

service metrics
host metrics
blackbox probes
application SLO symptoms
cloud metrics scraped by exporters

2. Rule Standard#

Rule file:

groups:
  - name: api.rules
    rules:
      - alert: P1ProdApiHigh5xxRate
        expr: |
          100 *
          sum(rate(http_requests_total{env="prod", service="api", status=~"5.."}[5m]))
          /
          clamp_min(sum(rate(http_requests_total{env="prod", service="api"}[5m])), 1)
          >= 5
        for: 5m
        labels:
          severity: P1
          env: prod
          service: api
          team: platform
          resource_type: service
          resource: api
        annotations:
          summary: "API 5xx rate is high"
          description: "API 5xx rate is {{ $value | printf \"%.2f\" }}% for 5m."
          dashboard_url: "https://grafana.example.com/d/api"
          runbook_url: "https://docs.example.com/runbooks/api-5xx"

Rules:

labels:
    used for routing and grouping
    should be stable and low-cardinality

annotations:
    used for human context
    can include dynamic values
    should include dashboard_url and runbook_url for P0/P1

3. Prometheus Config#

Prometheus sends alerts to Alertmanager:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager-1:9093
            - alertmanager-2:9093

rule_files:
  - /etc/prometheus/rules/*.yaml

Validate rules:

promtool check rules /etc/prometheus/rules/*.yaml

Reload:

curl -X POST http://prometheus:9090/-/reload

4. Alertmanager Routing#

Route by severity, env, team.

route:
  receiver: default
  group_by:
    - env
    - service
    - alertname
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - severity=~"P0|P1"
        - env="prod"
      receiver: pagerduty-prod
      continue: true

    - matchers:
        - team="platform"
      receiver: platform-slack

receivers:
  - name: default

  - name: platform-slack
    slack_configs:
      - api_url_file: /etc/alertmanager/secrets/slack-webhook
        channel: "#platform-alerts"
        send_resolved: true
        title: '{{ .CommonLabels.severity }} {{ .CommonLabels.env }} {{ .CommonLabels.service }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }} {{ .Annotations.runbook_url }}{{ "\n" }}{{ end }}'

  - name: pagerduty-prod
    pagerduty_configs:
      - routing_key_file: /etc/alertmanager/secrets/pagerduty-routing-key
        send_resolved: true

5. Inhibition#

Use inhibition to reduce duplicate alerts.

inhibit_rules:
  - source_matchers:
      - alertname="P0ProdApiDown"
    target_matchers:
      - severity=~"P1|P2"
      - service="api"
    equal:
      - env
      - service

Pattern:

when:
    service-down fires

mute:
    latency high
    5xx high
    exporter down for same service

6. Common Alerts#

Instance down:

- alert: P1ProdInstanceDown
  expr: up{env="prod"} == 0
  for: 3m
  labels:
    severity: P1
    env: prod
    team: platform
    resource_type: instance
  annotations:
    summary: "Prometheus target is down: {{ $labels.job }} {{ $labels.instance }}"
    runbook_url: "https://docs.example.com/runbooks/target-down"

Blackbox probe failed:

- alert: P0ProdApiBlackboxFailed
  expr: probe_success{env="prod", service="api"} == 0
  for: 3m
  labels:
    severity: P0
    env: prod
    service: api
    team: platform
    resource_type: endpoint
  annotations:
    summary: "API blackbox probe failed"
    dashboard_url: "https://grafana.example.com/d/blackbox"
    runbook_url: "https://docs.example.com/runbooks/api-down"

7. Production Checklist#

rules:
    promtool check rules in CI
    for duration set for every paging alert
    labels follow standard
    annotations include runbook/dashboard

alertmanager:
    HA pair configured
    route by severity/env/team
    send_resolved enabled for P0/P1
    secrets mounted from file or secret store
    silences audited

quality:
    every page is actionable
    low-cardinality labels only
    inhibition prevents alert storms
    noisy alerts reviewed regularly