Prometheus Rules

SRE Prometheus Rules#

Important Points#

rules 文件应该覆盖：
    1. recording rules：把常用复杂 PromQL 预计算出来
    2. alerting rules：只对需要行动的问题告警
    3. SLO rules：计算 availability、error rate、burn rate

推荐拆分：
    service-recording-rules.yml
    service-alerting-rules.yml
    service-slo-rules.yml
    infrastructure-alerting-rules.yml

1. Recording Rules#

groups:
  - name: order-api-recording-rules
    interval: 30s
    rules:
      - record: service:http_requests:rate5m
        expr: |
          sum by (service, route, method, status) (
            rate(http_requests_total[5m])
          )

      - record: service:http_errors:rate5m
        expr: |
          sum by (service, route, method, status) (
            rate(http_requests_total{status=~"5.."}[5m])
          )

      - record: service:http_error_ratio:rate5m
        expr: |
          sum by (service) (
            rate(http_requests_total{status=~"5.."}[5m])
          )
          /
          sum by (service) (
            rate(http_requests_total[5m])
          )

      - record: service:http_request_duration:p95_5m
        expr: |
          histogram_quantile(
            0.95,
            sum by (le, service, route) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

2. Service Alerts#

groups:
  - name: order-api-alerting-rules
    rules:
      - alert: HighHttpErrorRate
        expr: service:http_error_ratio:rate5m{service="order-api"} > 0.05
        for: 10m
        labels:
          severity: critical
          service: order-api
          team: platform
        annotations:
          summary: "High HTTP error rate"
          description: "order-api 5xx error ratio is higher than 5% for 10 minutes."
          dashboard_url: "https://grafana.example.com/d/order-api"
          runbook_url: "https://wiki.example.com/runbooks/order-api-5xx"

      - alert: HighP95Latency
        expr: service:http_request_duration:p95_5m{service="order-api"} > 1
        for: 10m
        labels:
          severity: warning
          service: order-api
          team: platform
        annotations:
          summary: "High p95 latency"
          description: "order-api p95 latency is higher than 1s for 10 minutes."
          dashboard_url: "https://grafana.example.com/d/order-api"
          runbook_url: "https://wiki.example.com/runbooks/order-api-latency"

3. Queue / Worker Alerts#

groups:
  - name: order-worker-alerting-rules
    rules:
      - alert: HighQueueLag
        expr: |
          max by (service, queue, consumer_group) (
            queue_consumer_lag{service="order-worker"}
          ) > 10000
        for: 15m
        labels:
          severity: warning
          service: order-worker
          team: platform
        annotations:
          summary: "High queue consumer lag"
          description: "order-worker queue consumer lag is higher than 10000 for 15 minutes."

      - alert: WorkerJobFailureRateHigh
        expr: |
          sum(rate(job_failures_total{service="order-worker"}[5m]))
          /
          sum(rate(job_runs_total{service="order-worker"}[5m]))
          > 0.05
        for: 10m
        labels:
          severity: warning
          service: order-worker
          team: platform
        annotations:
          summary: "High worker job failure rate"
          description: "order-worker job failure rate is higher than 5% for 10 minutes."

4. Infrastructure Alerts#

groups:
  - name: infrastructure-alerting-rules
    rules:
      - alert: NodeHighCpuUsage
        expr: |
          100 -
          avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          ) * 100 > 85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage"
          description: "CPU usage is higher than 85% for 15 minutes."

      - alert: NodeHighMemoryUsage
        expr: |
          (
            1 -
            node_memory_MemAvailable_bytes
            /
            node_memory_MemTotal_bytes
          ) * 100 > 90
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage is higher than 90% for 15 minutes."

      - alert: DiskWillFillSoon
        expr: |
          predict_linear(
            node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[7d],
            7 * 24 * 3600
          ) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk will fill soon"
          description: "Disk is predicted to fill within 7 days."

5. SLO Alerts#

groups:
  - name: order-api-slo-rules
    rules:
      - record: service:slo_error_ratio:rate5m
        expr: |
          sum by (service) (
            rate(http_requests_total{status=~"5.."}[5m])
          )
          /
          sum by (service) (
            rate(http_requests_total[5m])
          )

      - alert: FastErrorBudgetBurn
        expr: service:slo_error_ratio:rate5m{service="order-api"} / 0.001 > 14
        for: 5m
        labels:
          severity: critical
          service: order-api
          team: platform
        annotations:
          summary: "Fast error budget burn"
          description: "order-api is burning error budget faster than 14x."

      - alert: SlowErrorBudgetBurn
        expr: service:slo_error_ratio:rate5m{service="order-api"} / 0.001 > 3
        for: 1h
        labels:
          severity: warning
          service: order-api
          team: platform
        annotations:
          summary: "Slow error budget burn"
          description: "order-api is burning error budget faster than 3x for 1 hour."

6. Rule Checklist#

Before merging rules:
    [ ] alert has severity
    [ ] alert has service / team
    [ ] alert has summary / description
    [ ] alert has dashboard_url / runbook_url
    [ ] expression is based on rate / increase for counter
    [ ] histogram uses histogram_quantile correctly
    [ ] ratio alert checks numerator and denominator
    [ ] for duration avoids short spikes
    [ ] rule is tested in Prometheus / vmalert