1. Principle#
when to add custom metrics#
适合加 custom metrics:
业务关键路径
order created / payment success / checkout duration
SLI / SLO 相关指标
request count / error count / request latency
依赖组件状态
database query duration / cache hit ratio / queue lag
异步任务
job duration / job failures / pending tasks
需要告警或容量规划
queue depth / active users / storage usage
不适合加 custom metrics:
只用于排查单次请求的动态信息
request_id / trace_id / user_id / order_id
可以从 log 或 trace 查到的明细
raw payload / exception stack / SQL statement
没有明确使用场景的指标
nobody knows how to query / alert / dashboardmetric name#
metric name 表达“是什么”,不是“来自哪里”
❌ aws_s3_usage_bytes
✅ storage_usage_bytes{service="s3"}
metric name 不要放动态信息
❌ user_123456_login_total
✅ user_login_total{result="success"}
metric name 不要放环境 / region / instance
❌ prod_hk_api_requests_total
✅ api_requests_total{env="prod", region="hk"}
metric name 需要带明确单位
❌ request_duration
✅ request_duration_seconds
metric name 尽量表达业务动作或资源状态
request / error / retry / queue / connection / latency / duration / size / usagestandard naming convention:
全部小写 + 下划线
<namespace>_<resource>_<unit>_<suffix>
examples:
http_request_duration_seconds
queue_message_processing_duration_seconds
database_query_duration_seconds
storage_usage_bytes
orders_created_totalsuffix:
Counter (只增) _total
api_requests_total
errors_total
orders_created_total
Gauge (上下波动) 通常不加 _total
memory_usage_bytes
queue_depth
active_users
Histogram 必须带单位
request_duration_seconds
database_query_duration_seconds
queue_message_processing_duration_secondsenterprise examples:
HTTP / API
http_requests_total
http_request_duration_seconds
http_request_size_bytes
http_response_size_bytes
http_in_flight_requests
RPC / gRPC
rpc_requests_total
rpc_request_duration_seconds
grpc_server_handled_total
grpc_server_handling_seconds
Database
database_connections
database_queries_total
database_query_duration_seconds
database_transaction_duration_seconds
database_deadlocks_total
Cache
cache_hits_total
cache_misses_total
cache_evictions_total
cache_operation_duration_seconds
cache_memory_usage_bytes
Queue / Message
queue_depth
queue_messages_published_total
queue_messages_consumed_total
queue_message_processing_duration_seconds
queue_consumer_lag
Background Job / Cron
job_runs_total
job_failures_total
job_duration_seconds
job_last_success_timestamp_seconds
job_pending_tasks
Storage / File
storage_usage_bytes
storage_objects_total
storage_read_bytes_total
storage_write_bytes_total
file_uploads_total
file_upload_duration_seconds
Business
orders_created_total
payments_total
payment_failures_total
checkout_duration_seconds
active_users
sessions_total
External Dependency
external_requests_total
external_request_duration_seconds
external_errors_total
external_retries_total
external_circuit_breaker_openbad vs good:
❌ api_latency
✅ http_request_duration_seconds{service="order-api"}
❌ redis_cache_count
✅ cache_hits_total{backend="redis"}
❌ kafka_lag_user_topic
✅ queue_consumer_lag{broker="kafka", topic="user", consumer_group="syncer"}
❌ mysql_slow_query
✅ database_query_duration_seconds{database="mysql"}
❌ payment_error
✅ payment_failures_total{reason="insufficient_balance"}
❌ cpu_usage
✅ process_cpu_seconds_total / node_cpu_seconds_totalmetric type#
Counter:
只增不减,进程重启后可以从 0 开始
用 rate / increase 查询
examples:
http_requests_total
payment_failures_total
queue_messages_consumed_total
Gauge:
当前状态值,可以上升也可以下降
直接查询当前值,或者 avg / max / min
examples:
queue_depth
active_users
database_connections
Histogram:
统计一组观测值的分布,适合耗时 / 大小
会生成 _bucket / _sum / _count
用 histogram_quantile 计算 P90 / P95 / P99
examples:
http_request_duration_seconds
database_query_duration_seconds
file_upload_duration_secondslabels#
label 表达“维度”,用于过滤、分组、聚合
standard labels:
service / app
属于哪个服务
http_requests_total{service="payment-api"}
job
Prometheus scrape job 名称
up{job="node-exporter"}
instance
被采集目标实例,通常是 host:port
up{instance="10.0.1.12:9100"}
env / environment
运行环境
requests_total{env="prod"}
region / zone
地域 / 可用区
requests_total{region="ap-east-1", zone="ap-east-1a"}
cluster
Kubernetes / VM / 多集群场景下的集群名
requests_total{cluster="prod-hk"}
namespace
Kubernetes namespace
container_cpu_usage_seconds_total{namespace="monitoring"}
pod
Kubernetes pod name
container_memory_usage_bytes{pod="prometheus-server-xxx"}
container
container name
container_cpu_usage_seconds_total{container="app"}
method
HTTP method
http_requests_total{method="GET"}
route / path
HTTP route,推荐使用模板化 route,不要使用原始 path
✅ http_requests_total{route="/api/users/:id"}
❌ http_requests_total{path="/api/users/123456"}
status / code
HTTP status code
http_requests_total{status="200"}
handler
处理请求的 handler / controller / function
http_request_duration_seconds{handler="CreateUser"}
error / reason
错误类型或失败原因
job_failures_total{reason="timeout"}
version
应用版本
build_info{service="payment-api", version="1.2.3"}label rules:
label value 必须是有限集合
✅ method="GET" / status="500" / env="prod"
❌ user_id="123456" / request_id="abc" / email="a@x.com"
不要把高基数字段放到 label
user_id / order_id / trace_id / session_id / ip / full_url
能聚合的维度放 label
requests_total{service="api", method="GET", status="200"}
不能聚合的语义放 metric name
✅ queue_depth{queue="email"}
❌ queue{type="depth", queue="email"}final checklist#
Before adding a metric:
[ ] 这个指标会被 dashboard / alert / SLO 使用吗
[ ] metric name 是否表达“是什么”
[ ] 是否包含明确单位 seconds / bytes / total
[ ] Counter 是否以 _total 结尾
[ ] Histogram 是否有合理 bucket
[ ] label 是否都是有限集合
[ ] 是否避免 user_id / request_id / trace_id / full_url
[ ] 是否可以按 service / env / region 聚合
[ ] 是否能用 PromQL 直接验证2. Hands-on: Node.js Expose Metrics#
demo path:
demo/prometheus-custom-metrics-nodejsrun:
cd demo/prometheus-custom-metrics-nodejs
npm install
npm starttest:
curl http://localhost:3000/orders
curl http://localhost:3000/payments
curl http://localhost:3000/metricsdemo exposes:
process / nodejs default metrics:
process_cpu_seconds_total
process_resident_memory_bytes
nodejs_eventloop_lag_seconds
custom metrics:
http_requests_total
http_request_duration_seconds
orders_created_total
payments_total
payment_failures_total
queue_depth3. Prometheus Scrape Config#
scrape_configs:
- job_name: nodejs-custom-metrics
metrics_path: /metrics
scrape_interval: 15s
static_configs:
- targets:
- localhost:3000注意:
service / env / region 可以由应用自己 expose,也可以在 scrape config 里加。
不要两边都加同名 label,避免 label conflict。verify:
up{job="nodejs-custom-metrics"}
http_requests_total{service="order-api"}
rate(http_requests_total{service="order-api"}[5m])4. vmagent Scrape Config#
vmagent 使用 Prometheus scrape config 格式,remote_write 到 VictoriaMetrics。
global:
scrape_interval: 15s
scrape_configs:
- job_name: nodejs-custom-metrics
metrics_path: /metrics
static_configs:
- targets:
- localhost:3000start example:
vmagent \
-promscrape.config=prometheus.yml \
-remoteWrite.url=http://localhost:8428/api/v1/write5. PromQL Examples#
# request rate
sum by (service, method, route, status) (
rate(http_requests_total[5m])
)
# error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# p95 latency
histogram_quantile(
0.95,
sum by (le, service, route, method) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# payment failure rate
sum(rate(payment_failures_total[5m]))
/
sum(rate(payments_total[5m]))
# queue depth
queue_depth{service="order-api"}