Hands-on SRE With Prometheus#
Important Points#
这个例子覆盖 SRE 日常最核心的内容:
1. application metrics 怎么写
2. node / container / database / queue 指标怎么收集
3. availability / latency / traffic / errors 怎么用 PromQL 算
4. saturation / capacity 怎么判断
5. SLO / error budget / burn rate 怎么实现
6. alert rules 应该怎么写
7. dashboard 应该看什么example system:
order-api
接收用户下单请求
调用 payment-api
写入 PostgreSQL
发布消息到 Kafka / queue
order-worker
消费 queue
处理异步任务
dependencies:
PostgreSQL
Redis
Kafka / RabbitMQ
payment-api1. Metrics Sources#
| Target | How To Collect | Important Metrics |
|---|---|---|
order-api |
app exposes /metrics |
request count, latency, errors, in-flight |
order-worker |
app exposes /metrics |
job count, job duration, failures, queue lag |
| VM / Node | node exporter | CPU, memory, disk, network |
| Container | cAdvisor / kubelet | container CPU, memory, restart |
| PostgreSQL | postgres exporter | connections, query latency, locks |
| Redis | redis exporter | memory, hit rate, evictions |
| Kafka | kafka exporter | consumer lag, broker health |
| External API | app metrics / blackbox exporter | dependency latency, error rate |
2. Prometheus Scrape Config#
scrape_configs:
- job_name: order-api
metrics_path: /metrics
static_configs:
- targets:
- order-api:3000
labels:
service: order-api
env: prod
team: platform
- job_name: order-worker
metrics_path: /metrics
static_configs:
- targets:
- order-worker:3000
labels:
service: order-worker
env: prod
team: platform
- job_name: node-exporter
static_configs:
- targets:
- node-1:9100
- node-2:9100
- job_name: postgres-exporter
static_configs:
- targets:
- postgres-exporter:9187
- job_name: blackbox
metrics_path: /probe
params:
module:
- http_2xx
static_configs:
- targets:
- https://api.example.com/healthz
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:91153. Application Metrics#
order-api should expose:
HTTP:
http_requests_total{service, method, route, status}
http_request_duration_seconds_bucket{service, method, route, status, le}
http_in_flight_requests{service}
Business:
orders_created_total{service, channel}
payments_total{service, provider, result}
payment_failures_total{service, provider, reason}
Dependency:
external_requests_total{service, dependency, method, status}
external_request_duration_seconds_bucket{service, dependency, le}
database_queries_total{service, database, operation, status}
database_query_duration_seconds_bucket{service, database, operation, le}
Queue:
queue_messages_published_total{service, queue}
queue_messages_consumed_total{service, queue}
queue_message_processing_duration_seconds_bucket{service, queue, le}
queue_depth{service, queue}
queue_consumer_lag{service, queue, consumer_group}4. Availability#
What It Means#
availability = successful requests / total requests
通常:
2xx / 3xx = success
5xx = server failure
4xx 是否计入失败要看业务PromQL#
sum(rate(http_requests_total{service="order-api", status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-api"}[5m]))Alert#
groups:
- name: sre-availability
rules:
- alert: OrderApiLowAvailability
expr: |
(
sum(rate(http_requests_total{service="order-api", status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-api"}[5m]))
) < 0.99
for: 10m
labels:
severity: critical
service: order-api
annotations:
summary: "order-api availability is below 99%"
description: "Successful request ratio is below 99% for 10 minutes."5. Latency#
What It Means#
latency 重点看分位数:
p50: 普通用户体验
p95: 大部分用户体验
p99: 长尾问题PromQL#
histogram_quantile(
0.95,
sum by (le, service, route) (
rate(http_request_duration_seconds_bucket{service="order-api"}[5m])
)
)Alert#
groups:
- name: sre-latency
rules:
- alert: OrderApiHighP95Latency
expr: |
histogram_quantile(
0.95,
sum by (le, service, route) (
rate(http_request_duration_seconds_bucket{service="order-api"}[5m])
)
) > 1
for: 10m
labels:
severity: warning
service: order-api
annotations:
summary: "order-api p95 latency is high"
description: "p95 latency is higher than 1s for 10 minutes."6. Traffic#
What It Means#
traffic 用来判断:
当前流量是多少
是否突然下降
是否突然升高
是否超过历史峰值PromQL#
Request rate:
sum by (service, route, method) (
rate(http_requests_total{service="order-api"}[5m])
)Traffic drop:
sum(rate(http_requests_total{service="order-api"}[5m]))
<
sum(rate(http_requests_total{service="order-api"}[1h] offset 1d)) * 0.57. Errors#
What It Means#
errors 要看:
server error rate
dependency error rate
business failure ratePromQL#
HTTP 5xx error rate:
sum(rate(http_requests_total{service="order-api", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-api"}[5m]))Payment failure rate:
sum(rate(payment_failures_total{service="order-api"}[5m]))
/
sum(rate(payments_total{service="order-api"}[5m]))Dependency error rate:
sum by (dependency) (
rate(external_requests_total{service="order-api", status=~"5.."}[5m])
)
/
sum by (dependency) (
rate(external_requests_total{service="order-api"}[5m])
)8. Saturation#
What It Means#
saturation 表示系统接近极限:
CPU 接近满
memory 接近满
disk 快满
connection pool 快满
queue 堆积
consumer lag 增加Node CPU#
100 -
avg by (instance) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100Memory Usage#
(
1 -
node_memory_MemAvailable_bytes
/
node_memory_MemTotal_bytes
) * 100Disk Usage#
(
1 -
node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
/
node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}
) * 100Queue Lag#
max by (queue, consumer_group) (
queue_consumer_lag{service="order-worker"}
)9. Dependency Health#
Database#
histogram_quantile(
0.95,
sum by (le, database, operation) (
rate(database_query_duration_seconds_bucket{service="order-api"}[5m])
)
)sum by (database, operation, status) (
rate(database_queries_total{service="order-api"}[5m])
)Cache#
sum(rate(cache_hits_total{service="order-api"}[5m]))
/
(
sum(rate(cache_hits_total{service="order-api"}[5m]))
+
sum(rate(cache_misses_total{service="order-api"}[5m]))
)External API#
histogram_quantile(
0.95,
sum by (le, dependency) (
rate(external_request_duration_seconds_bucket{service="order-api"}[5m])
)
)10. Capacity#
What It Means#
capacity 不是看当前有没有问题,而是看:
按当前增长速度多久会出问题
failover 后资源是否足够
峰值流量是否还有 headroomDisk Forecast#
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[7d], 14 * 24 * 3600) < 0Peak QPS#
max_over_time(
sum(rate(http_requests_total{service="order-api"}[5m]))[7d:]
)Headroom#
headroom = 1 - current_peak / tested_max_capacity
example:
tested_max_capacity = 10000 rps
current_peak = 6000 rps
headroom = 40%11. SLO / Error Budget#
99.9% Availability SLO#
SLO:
99.9% of order-api requests succeed over 30 days
error budget:
0.1% allowed failure30d Availability#
sum(increase(http_requests_total{service="order-api", status!~"5.."}[30d]))
/
sum(increase(http_requests_total{service="order-api"}[30d]))Error Budget Burn Rate#
(
sum(rate(http_requests_total{service="order-api", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-api"}[5m]))
)
/
0.001Fast Burn Alert#
groups:
- name: sre-slo
rules:
- alert: OrderApiFastErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{service="order-api", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-api"}[5m]))
)
/
0.001 > 14
for: 5m
labels:
severity: critical
service: order-api
annotations:
summary: "order-api is burning error budget too fast"
description: "Error budget burn rate is higher than 14x for 5 minutes."12. Dashboard Layout#
Top:
availability
p95 latency
error rate
request rate
Service:
request rate by route
error rate by route
latency by route
status code distribution
Dependency:
database query latency
database error rate
cache hit rate
external API latency / error rate
Worker / Queue:
queue depth
consumer lag
job processing duration
job failure rate
Infrastructure:
CPU
memory
disk
network
container restart
SLO:
30d availability
error budget remaining
burn rate13. Final Checklist#
For every production service:
[ ] /metrics exposes RED metrics
[ ] dependency metrics are collected
[ ] queue / worker metrics are collected
[ ] node / container metrics are collected
[ ] availability PromQL exists
[ ] latency PromQL exists
[ ] error rate PromQL exists
[ ] saturation PromQL exists
[ ] SLO PromQL exists
[ ] alert rules use user-impact metrics
[ ] dashboard shows service + dependency + infrastructure
[ ] runbook links exist in alerts