SRE Prometheus Rules#
Important Points#
rules 文件应该覆盖:
1. recording rules:把常用复杂 PromQL 预计算出来
2. alerting rules:只对需要行动的问题告警
3. SLO rules:计算 availability、error rate、burn rate
推荐拆分:
service-recording-rules.yml
service-alerting-rules.yml
service-slo-rules.yml
infrastructure-alerting-rules.yml
1. Recording Rules#
groups:
- name: order-api-recording-rules
interval: 30s
rules:
- record: service:http_requests:rate5m
expr: |
sum by (service, route, method, status) (
rate(http_requests_total[5m])
)
- record: service:http_errors:rate5m
expr: |
sum by (service, route, method, status) (
rate(http_requests_total{status=~"5.."}[5m])
)
- record: service:http_error_ratio:rate5m
expr: |
sum by (service) (
rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (service) (
rate(http_requests_total[5m])
)
- record: service:http_request_duration:p95_5m
expr: |
histogram_quantile(
0.95,
sum by (le, service, route) (
rate(http_request_duration_seconds_bucket[5m])
)
)
2. Service Alerts#
groups:
- name: order-api-alerting-rules
rules:
- alert: HighHttpErrorRate
expr: service:http_error_ratio:rate5m{service="order-api"} > 0.05
for: 10m
labels:
severity: critical
service: order-api
team: platform
annotations:
summary: "High HTTP error rate"
description: "order-api 5xx error ratio is higher than 5% for 10 minutes."
dashboard_url: "https://grafana.example.com/d/order-api"
runbook_url: "https://wiki.example.com/runbooks/order-api-5xx"
- alert: HighP95Latency
expr: service:http_request_duration:p95_5m{service="order-api"} > 1
for: 10m
labels:
severity: warning
service: order-api
team: platform
annotations:
summary: "High p95 latency"
description: "order-api p95 latency is higher than 1s for 10 minutes."
dashboard_url: "https://grafana.example.com/d/order-api"
runbook_url: "https://wiki.example.com/runbooks/order-api-latency"
3. Queue / Worker Alerts#
groups:
- name: order-worker-alerting-rules
rules:
- alert: HighQueueLag
expr: |
max by (service, queue, consumer_group) (
queue_consumer_lag{service="order-worker"}
) > 10000
for: 15m
labels:
severity: warning
service: order-worker
team: platform
annotations:
summary: "High queue consumer lag"
description: "order-worker queue consumer lag is higher than 10000 for 15 minutes."
- alert: WorkerJobFailureRateHigh
expr: |
sum(rate(job_failures_total{service="order-worker"}[5m]))
/
sum(rate(job_runs_total{service="order-worker"}[5m]))
> 0.05
for: 10m
labels:
severity: warning
service: order-worker
team: platform
annotations:
summary: "High worker job failure rate"
description: "order-worker job failure rate is higher than 5% for 10 minutes."
4. Infrastructure Alerts#
groups:
- name: infrastructure-alerting-rules
rules:
- alert: NodeHighCpuUsage
expr: |
100 -
avg by (instance) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100 > 85
for: 15m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "CPU usage is higher than 85% for 15 minutes."
- alert: NodeHighMemoryUsage
expr: |
(
1 -
node_memory_MemAvailable_bytes
/
node_memory_MemTotal_bytes
) * 100 > 90
for: 15m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is higher than 90% for 15 minutes."
- alert: DiskWillFillSoon
expr: |
predict_linear(
node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[7d],
7 * 24 * 3600
) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk will fill soon"
description: "Disk is predicted to fill within 7 days."
5. SLO Alerts#
groups:
- name: order-api-slo-rules
rules:
- record: service:slo_error_ratio:rate5m
expr: |
sum by (service) (
rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (service) (
rate(http_requests_total[5m])
)
- alert: FastErrorBudgetBurn
expr: service:slo_error_ratio:rate5m{service="order-api"} / 0.001 > 14
for: 5m
labels:
severity: critical
service: order-api
team: platform
annotations:
summary: "Fast error budget burn"
description: "order-api is burning error budget faster than 14x."
- alert: SlowErrorBudgetBurn
expr: service:slo_error_ratio:rate5m{service="order-api"} / 0.001 > 3
for: 1h
labels:
severity: warning
service: order-api
team: platform
annotations:
summary: "Slow error budget burn"
description: "order-api is burning error budget faster than 3x for 1 hour."
6. Rule Checklist#
Before merging rules:
[ ] alert has severity
[ ] alert has service / team
[ ] alert has summary / description
[ ] alert has dashboard_url / runbook_url
[ ] expression is based on rate / increase for counter
[ ] histogram uses histogram_quantile correctly
[ ] ratio alert checks numerator and denominator
[ ] for duration avoids short spikes
[ ] rule is tested in Prometheus / vmalert