Hands-on

Hands-on SRE With Prometheus#

Important Points#

这个例子覆盖 SRE 日常最核心的内容:
    1. application metrics 怎么写
    2. node / container / database / queue 指标怎么收集
    3. availability / latency / traffic / errors 怎么用 PromQL 算
    4. saturation / capacity 怎么判断
    5. SLO / error budget / burn rate 怎么实现
    6. alert rules 应该怎么写
    7. dashboard 应该看什么
example system:
    order-api
        接收用户下单请求
        调用 payment-api
        写入 PostgreSQL
        发布消息到 Kafka / queue

    order-worker
        消费 queue
        处理异步任务

    dependencies:
        PostgreSQL
        Redis
        Kafka / RabbitMQ
        payment-api

1. Metrics Sources#

Target How To Collect Important Metrics
order-api app exposes /metrics request count, latency, errors, in-flight
order-worker app exposes /metrics job count, job duration, failures, queue lag
VM / Node node exporter CPU, memory, disk, network
Container cAdvisor / kubelet container CPU, memory, restart
PostgreSQL postgres exporter connections, query latency, locks
Redis redis exporter memory, hit rate, evictions
Kafka kafka exporter consumer lag, broker health
External API app metrics / blackbox exporter dependency latency, error rate

2. Prometheus Scrape Config#

scrape_configs:
  - job_name: order-api
    metrics_path: /metrics
    static_configs:
      - targets:
          - order-api:3000
        labels:
          service: order-api
          env: prod
          team: platform

  - job_name: order-worker
    metrics_path: /metrics
    static_configs:
      - targets:
          - order-worker:3000
        labels:
          service: order-worker
          env: prod
          team: platform

  - job_name: node-exporter
    static_configs:
      - targets:
          - node-1:9100
          - node-2:9100

  - job_name: postgres-exporter
    static_configs:
      - targets:
          - postgres-exporter:9187

  - job_name: blackbox
    metrics_path: /probe
    params:
      module:
        - http_2xx
    static_configs:
      - targets:
          - https://api.example.com/healthz
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

3. Application Metrics#

order-api should expose:

HTTP:
    http_requests_total{service, method, route, status}
    http_request_duration_seconds_bucket{service, method, route, status, le}
    http_in_flight_requests{service}

Business:
    orders_created_total{service, channel}
    payments_total{service, provider, result}
    payment_failures_total{service, provider, reason}

Dependency:
    external_requests_total{service, dependency, method, status}
    external_request_duration_seconds_bucket{service, dependency, le}
    database_queries_total{service, database, operation, status}
    database_query_duration_seconds_bucket{service, database, operation, le}

Queue:
    queue_messages_published_total{service, queue}
    queue_messages_consumed_total{service, queue}
    queue_message_processing_duration_seconds_bucket{service, queue, le}
    queue_depth{service, queue}
    queue_consumer_lag{service, queue, consumer_group}

4. Availability#

What It Means#

availability = successful requests / total requests

通常:
    2xx / 3xx = success
    5xx = server failure
    4xx 是否计入失败要看业务

PromQL#

sum(rate(http_requests_total{service="order-api", status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-api"}[5m]))

Alert#

groups:
  - name: sre-availability
    rules:
      - alert: OrderApiLowAvailability
        expr: |
          (
            sum(rate(http_requests_total{service="order-api", status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="order-api"}[5m]))
          ) < 0.99
        for: 10m
        labels:
          severity: critical
          service: order-api
        annotations:
          summary: "order-api availability is below 99%"
          description: "Successful request ratio is below 99% for 10 minutes."

5. Latency#

What It Means#

latency 重点看分位数:
    p50: 普通用户体验
    p95: 大部分用户体验
    p99: 长尾问题

PromQL#

histogram_quantile(
  0.95,
  sum by (le, service, route) (
    rate(http_request_duration_seconds_bucket{service="order-api"}[5m])
  )
)

Alert#

groups:
  - name: sre-latency
    rules:
      - alert: OrderApiHighP95Latency
        expr: |
          histogram_quantile(
            0.95,
            sum by (le, service, route) (
              rate(http_request_duration_seconds_bucket{service="order-api"}[5m])
            )
          ) > 1
        for: 10m
        labels:
          severity: warning
          service: order-api
        annotations:
          summary: "order-api p95 latency is high"
          description: "p95 latency is higher than 1s for 10 minutes."

6. Traffic#

What It Means#

traffic 用来判断:
    当前流量是多少
    是否突然下降
    是否突然升高
    是否超过历史峰值

PromQL#

Request rate:

sum by (service, route, method) (
  rate(http_requests_total{service="order-api"}[5m])
)

Traffic drop:

sum(rate(http_requests_total{service="order-api"}[5m]))
<
sum(rate(http_requests_total{service="order-api"}[1h] offset 1d)) * 0.5

7. Errors#

What It Means#

errors 要看:
    server error rate
    dependency error rate
    business failure rate

PromQL#

HTTP 5xx error rate:

sum(rate(http_requests_total{service="order-api", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-api"}[5m]))

Payment failure rate:

sum(rate(payment_failures_total{service="order-api"}[5m]))
/
sum(rate(payments_total{service="order-api"}[5m]))

Dependency error rate:

sum by (dependency) (
  rate(external_requests_total{service="order-api", status=~"5.."}[5m])
)
/
sum by (dependency) (
  rate(external_requests_total{service="order-api"}[5m])
)

8. Saturation#

What It Means#

saturation 表示系统接近极限:
    CPU 接近满
    memory 接近满
    disk 快满
    connection pool 快满
    queue 堆积
    consumer lag 增加

Node CPU#

100 -
avg by (instance) (
  rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100

Memory Usage#

(
  1 -
  node_memory_MemAvailable_bytes
  /
  node_memory_MemTotal_bytes
) * 100

Disk Usage#

(
  1 -
  node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
  /
  node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}
) * 100

Queue Lag#

max by (queue, consumer_group) (
  queue_consumer_lag{service="order-worker"}
)

9. Dependency Health#

Database#

histogram_quantile(
  0.95,
  sum by (le, database, operation) (
    rate(database_query_duration_seconds_bucket{service="order-api"}[5m])
  )
)
sum by (database, operation, status) (
  rate(database_queries_total{service="order-api"}[5m])
)

Cache#

sum(rate(cache_hits_total{service="order-api"}[5m]))
/
(
  sum(rate(cache_hits_total{service="order-api"}[5m]))
  +
  sum(rate(cache_misses_total{service="order-api"}[5m]))
)

External API#

histogram_quantile(
  0.95,
  sum by (le, dependency) (
    rate(external_request_duration_seconds_bucket{service="order-api"}[5m])
  )
)

10. Capacity#

What It Means#

capacity 不是看当前有没有问题,而是看:
    按当前增长速度多久会出问题
    failover 后资源是否足够
    峰值流量是否还有 headroom

Disk Forecast#

predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[7d], 14 * 24 * 3600) < 0

Peak QPS#

max_over_time(
  sum(rate(http_requests_total{service="order-api"}[5m]))[7d:]
)

Headroom#

headroom = 1 - current_peak / tested_max_capacity

example:
    tested_max_capacity = 10000 rps
    current_peak = 6000 rps
    headroom = 40%

11. SLO / Error Budget#

99.9% Availability SLO#

SLO:
    99.9% of order-api requests succeed over 30 days

error budget:
    0.1% allowed failure

30d Availability#

sum(increase(http_requests_total{service="order-api", status!~"5.."}[30d]))
/
sum(increase(http_requests_total{service="order-api"}[30d]))

Error Budget Burn Rate#

(
  sum(rate(http_requests_total{service="order-api", status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total{service="order-api"}[5m]))
)
/
0.001

Fast Burn Alert#

groups:
  - name: sre-slo
    rules:
      - alert: OrderApiFastErrorBudgetBurn
        expr: |
          (
            sum(rate(http_requests_total{service="order-api", status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="order-api"}[5m]))
          )
          /
          0.001 > 14
        for: 5m
        labels:
          severity: critical
          service: order-api
        annotations:
          summary: "order-api is burning error budget too fast"
          description: "Error budget burn rate is higher than 14x for 5 minutes."

12. Dashboard Layout#

Top:
    availability
    p95 latency
    error rate
    request rate

Service:
    request rate by route
    error rate by route
    latency by route
    status code distribution

Dependency:
    database query latency
    database error rate
    cache hit rate
    external API latency / error rate

Worker / Queue:
    queue depth
    consumer lag
    job processing duration
    job failure rate

Infrastructure:
    CPU
    memory
    disk
    network
    container restart

SLO:
    30d availability
    error budget remaining
    burn rate

13. Final Checklist#

For every production service:
    [ ] /metrics exposes RED metrics
    [ ] dependency metrics are collected
    [ ] queue / worker metrics are collected
    [ ] node / container metrics are collected
    [ ] availability PromQL exists
    [ ] latency PromQL exists
    [ ] error rate PromQL exists
    [ ] saturation PromQL exists
    [ ] SLO PromQL exists
    [ ] alert rules use user-impact metrics
    [ ] dashboard shows service + dependency + infrastructure
    [ ] runbook links exist in alerts