Links#
https://prometheus.io/docs/instrumenting/exporters/
https://github.com/prometheus/node_exporter
https://github.com/ncabatoff/process-exporter
https://github.com/prometheus/blackbox_exporter
https://github.com/prometheus-community/yet-another-cloudwatch-exporter
https://github.com/prometheus-community/postgres_exporter
https://github.com/oliver006/redis_exporter
https://github.com/prometheus/mysqld_exporter
https://github.com/percona/mongodb_exporter
https://github.com/nginx/nginx-prometheus-exporter
https://github.com/prometheus/jmx_exporter
https://prometheus.io/docs/guides/cadvisor/1. Important Points#
Exporter 是 metrics adapter:
read from OS / app / database / cloud API
expose /metrics in Prometheus format
Prometheus / vmagent / VictoriaMetrics scrape it
核心原则:
exporter should run close to target
exporter endpoint should not be public internet
exporter credentials must be least privilege
do not export high-cardinality labels
do not enable all collectors blindly
scrape interval usually starts from 30s or 60s
AWS CloudWatch exporter usually starts from 300s to reduce cost
every exporter must have an up{} alert不要把 exporter 当成日志系统:
metrics:
counters
gauges
histograms
low cardinality labels
not metrics:
raw logs
request body
user id
token
SQL text
full URL with query string2. Exporter Selection#
| Scope | Exporter | Port | Use For | Notes |
|---|---|---|---|---|
| Linux VM / EC2 | node_exporter | 9100 | CPU, memory, disk, network, filesystem | every VM should have it |
| Process | process-exporter | 9256 | named process count, CPU, memory | useful for Grafana / VM / Alertmanager process check |
| HTTP / TCP / DNS / TLS | blackbox_exporter | 9115 | external user-view availability | use for ALB, CloudFront, API health, TLS expiry |
| AWS CloudWatch | YACE | 5000 | RDS, ECS, ALB, SQS, DynamoDB, CloudFront metrics | cost depends on CloudWatch API calls |
| Container on VM | cAdvisor | 8080 | Docker/container CPU, memory, network | useful on plain Docker host, not always needed on ECS/Fargate |
| PostgreSQL | postgres_exporter | 9187 | connections, locks, tx, replication, table stats | use read-only user with pg_monitor |
| Valkey / Redis | redis_exporter | 9121 | memory, connected clients, evictions, hit rate | avoid expensive key scanning in prod |
| MySQL / MariaDB | mysqld_exporter | 9104 | connections, QPS, InnoDB, replication | use dedicated low-privilege user |
| MongoDB | mongodb_exporter | 9216 | replication, sharding, storage, op counters | use clusterMonitor role |
| NGINX | nginx-prometheus-exporter | 9113 | active connections, requests, NGINX Plus metrics | open stub_status only to localhost/exporter |
| JVM / Kafka / Cassandra | jmx_exporter | custom | JMX MBeans to Prometheus | prefer javaagent mode |
| Kubernetes object state | kube-state-metrics | 8080 | pod/deployment/node object status | for Kubernetes, not ECS |
| Windows VM | windows_exporter | 9182 | Windows CPU, memory, disk, services | only if Windows hosts exist |
最小生产组合:
EC2 monitoring host:
node_exporter
process-exporter
blackbox_exporter
AWS managed services:
YACE
blackbox_exporter for user-view probes
self-managed database/cache:
postgres_exporter / redis_exporter / mysqld_exporter / mongodb_exporter
Java middleware:
jmx_exporter3. Common Scrape Config#
vmagent / Prometheus scrape config 示例:
global:
scrape_interval: 30s
scrape_timeout: 10s
scrape_configs:
- job_name: node
static_configs:
- targets:
- monitoring-ec2:9100
- app-ec2-1:9100
- job_name: process
static_configs:
- targets:
- monitoring-ec2:9256
- job_name: yace
scrape_interval: 300s
scrape_timeout: 60s
static_configs:
- targets:
- monitoring-ec2:5000
- job_name: blackbox-http
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://www.example.com/health
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: monitoring-ec2:9115每个 exporter 都应该有基础告警:
groups:
- name: exporter.rules
rules:
- alert: ExporterDown
expr: up == 0
for: 3m
labels:
severity: P1
component: exporter
annotations:
summary: "Exporter target is down: {{ $labels.job }} {{ $labels.instance }}"4. node_exporter#
when to use#
use for:
EC2
Linux VM
self-managed monitoring host
database host
cache host
do not use for:
managed AWS service itself
Fargate task OS metricsbest practices#
bind to private network or localhost with reverse proxy
exclude container overlay filesystems
exclude tmpfs/dev/proc/sys mountpoints
do not enable expensive collectors unless needed
use textfile collector for custom host-level metrics
alert on disk, memory, CPU, instance downsystemd hands-on#
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=:9100 \
--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run/credentials/.+|var/lib/docker/.+|var/lib/containers/storage/.+|var/lib/kubelet/.+)($|/) \
--collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetsudo useradd --system --no-create-home --shell /usr/sbin/nologin node_exporter
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
curl -s http://127.0.0.1:9100/metrics | headkey PromQL#
# host down
up{job="node"} == 0# disk usage percent
100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})# memory usage percent
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)# CPU usage percent
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 1005. process-exporter#
when to use#
use for:
verify Grafana / Alertmanager / VictoriaMetrics process exists
monitor legacy VM app process
track process CPU / memory by logical name
do not use for:
every short-lived process
per-user process inventoryconfig hands-on#
# /etc/process-exporter/config.yml
process_names:
- name: "victoriametrics"
exe:
- /usr/local/bin/victoria-metrics
- name: "alertmanager"
exe:
- /usr/local/bin/alertmanager
- name: "grafana"
cmdline:
- "grafana-server"
- name: "spug-gunicorn"
exe:
- /usr/bin/python3
cmdline:
- "gunicorn"
- "spug.wsgi"comm / exe / cmdline 的选择:
ps -eo pid,comm,argscomm:
from /proc/<pid>/comm
short process name
exe:
executable path
stable when binary path is fixed
cmdline:
command line args
useful for Python / Java / Node.js appssystemd hands-on#
# /etc/systemd/system/process-exporter.service
[Unit]
Description=Prometheus Process Exporter
After=network-online.target
[Service]
User=process_exporter
Group=process_exporter
ExecStart=/usr/local/bin/process-exporter \
--web.listen-address=:9256 \
--config.path=/etc/process-exporter/config.yml
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetsudo useradd --system --no-create-home --shell /usr/sbin/nologin process_exporter
sudo install -d -m 0755 /etc/process-exporter
sudo systemctl daemon-reload
sudo systemctl enable --now process-exporter
curl -s http://127.0.0.1:9256/metrics | grep namedprocess_namegroup_num_procskey PromQL#
# named process missing
namedprocess_namegroup_num_procs{groupname="victoriametrics"} == 0# process memory usage
namedprocess_namegroup_memory_bytes{groupname="victoriametrics",memtype="resident"}6. blackbox_exporter#
when to use#
use for:
public website
API health endpoint
ALB DNS
CloudFront domain
TCP port
DNS resolution
TLS certificate expiry
blackbox_exporter answers:
can user reach this endpoint?
is DNS working?
is TLS valid?
is response status expected?config hands-on#
# /etc/blackbox_exporter/blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
method: GET
preferred_ip_protocol: ip4
valid_http_versions:
- HTTP/1.1
- HTTP/2.0
valid_status_codes:
- 200
- 204
fail_if_ssl: false
fail_if_not_ssl: true
tcp_connect:
prober: tcp
timeout: 5s
dns_lookup:
prober: dns
timeout: 5s
dns:
query_name: example.com
query_type: A# docker-compose.yml
services:
blackbox_exporter:
image: quay.io/prometheus/blackbox-exporter:latest
container_name: blackbox_exporter
restart: unless-stopped
ports:
- "9115:9115"
volumes:
- ./blackbox.yml:/config/blackbox.yml:ro
command:
- "--config.file=/config/blackbox.yml"scrape config#
scrape_configs:
- job_name: blackbox-http
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://www.example.com/health
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: monitoring-ec2:9115key PromQL#
# endpoint unavailable
probe_success{job="blackbox-http"} == 0# HTTP probe latency p95 from probe duration gauge
probe_duration_seconds{job="blackbox-http"} > 2# TLS certificate expires in less than 14 days
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 147. YACE#
when to use#
use for:
AWS/RDS
AWS/ECS
AWS/ApplicationELB
AWS/SQS
AWS/DynamoDB
AWS/ElastiCache
AWS/CloudFront
do not use for:
application custom metrics
per-request high-cardinality dataYACE 成本注意:
CloudWatch API calls cost money
period should usually start at 300s
length should usually be 600s or 900s
only scrape metrics used by alerts or dashboards
tag discovery is useful but must keep labels controlledIAM policy minimal sample#
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics",
"tag:GetResources"
],
"Resource": "*"
}
]
}config hands-on#
# /etc/yace/config.yml
apiVersion: v1alpha1
sts-region: ap-east-1
discovery:
jobs:
- type: AWS/RDS
regions:
- ap-east-1
period: 300
length: 600
nilToZero: true
metrics:
- name: CPUUtilization
statistics: [Average]
- name: FreeableMemory
statistics: [Average]
- name: DatabaseConnections
statistics: [Average]
- name: ReadLatency
statistics: [Average]
- name: WriteLatency
statistics: [Average]
- type: AWS/SQS
regions:
- ap-east-1
period: 300
length: 600
nilToZero: true
metrics:
- name: ApproximateAgeOfOldestMessage
statistics: [Maximum]
- name: ApproximateNumberOfMessagesVisible
statistics: [Average]
- name: NumberOfMessagesDeleted
statistics: [Sum]
- type: AWS/ApplicationELB
regions:
- ap-east-1
period: 300
length: 600
nilToZero: true
metrics:
- name: RequestCount
statistics: [Sum]
- name: HTTPCode_Target_5XX_Count
statistics: [Sum]
- name: HealthyHostCount
statistics: [Minimum]
- name: TargetResponseTime
statistics: [p95]docker hands-on#
services:
yace:
image: ghcr.io/nerdswords/yet-another-cloudwatch-exporter:latest
container_name: yace
restart: unless-stopped
ports:
- "5000:5000"
volumes:
- ./config.yml:/config/config.yml:ro
environment:
AWS_REGION: ap-east-1
command:
- "--config.file=/config/config.yml"verify#
curl -s http://127.0.0.1:5000/metrics | grep aws_rds_cpu_utilization_average8. cAdvisor#
when to use#
use for:
plain Docker host
self-managed container host
local lab
be careful:
cAdvisor can expose many labels
container labels may increase cardinality
in ECS/Fargate, prefer ECS/CloudWatch metrics plus app /metrics firstdocker hands-on#
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsgkey PromQL#
# container CPU usage seconds rate
sum by (container_label_com_docker_compose_service) (
rate(container_cpu_usage_seconds_total{image!=""}[5m])
)# container memory usage bytes
container_memory_working_set_bytes{image!=""}9. postgres_exporter#
best practices#
use dedicated exporter user
grant pg_monitor
do not use superuser
use sslmode=require for remote database
do not put password in command line
store DSN in EnvironmentFile or secret managerPostgreSQL user#
CREATE USER postgres_exporter WITH PASSWORD 'change-me';
GRANT pg_monitor TO postgres_exporter;systemd hands-on#
# /etc/postgres_exporter/postgres_exporter.env
DATA_SOURCE_NAME=postgresql://postgres_exporter:change-me@db.example.com:5432/postgres?sslmode=require# /etc/systemd/system/postgres_exporter.service
[Unit]
Description=Prometheus PostgreSQL Exporter
After=network-online.target
[Service]
User=postgres_exporter
Group=postgres_exporter
EnvironmentFile=/etc/postgres_exporter/postgres_exporter.env
ExecStart=/usr/local/bin/postgres_exporter --web.listen-address=:9187
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetsudo useradd --system --no-create-home --shell /usr/sbin/nologin postgres_exporter
sudo install -d -m 0750 -o postgres_exporter -g postgres_exporter /etc/postgres_exporter
sudo systemctl daemon-reload
sudo systemctl enable --now postgres_exporter
curl -s http://127.0.0.1:9187/metrics | grep pg_upkey PromQL#
# PostgreSQL exporter cannot connect
pg_up == 0# active connections by database
pg_stat_database_numbackends# deadlocks increased in last 5 minutes
increase(pg_stat_database_deadlocks[5m]) > 010. redis_exporter For Valkey / Redis#
best practices#
use a read-only or limited ACL user if possible
do not enable key-level metrics in production unless bounded
monitor evictions, memory, connected clients, hit rate, rejected connections
put password in env file, not command lineValkey / Redis ACL user#
ACL SETUSER redis_exporter on >change-me +client +config|get +info +latency +slowlog +pingdocker hands-on#
services:
redis_exporter:
image: oliver006/redis_exporter:latest
container_name: redis_exporter
restart: unless-stopped
ports:
- "9121:9121"
environment:
REDIS_ADDR: "redis://valkey.example.com:6379"
REDIS_USER: "redis_exporter"
REDIS_PASSWORD: "change-me"key PromQL#
# exporter cannot reach Redis / Valkey
redis_up == 0# memory usage ratio
redis_memory_used_bytes / redis_memory_max_bytes# evictions happened
increase(redis_evicted_keys_total[5m]) > 0# cache hit rate
rate(redis_keyspace_hits_total[5m])
/
clamp_min(rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]), 1)11. mysqld_exporter#
best practices#
use dedicated exporter user
use config file permissions 0600
enable only needed collectors
monitor connection count, slow queries, InnoDB, replication, aborted connectsMySQL user#
CREATE USER 'mysqld_exporter'@'%' IDENTIFIED BY 'change-me' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'mysqld_exporter'@'%';
FLUSH PRIVILEGES;config hands-on#
# /etc/mysqld_exporter/.my.cnf
[client]
user=mysqld_exporter
password=change-me
host=mysql.example.com
port=3306
ssl-mode=REQUIRED# /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=Prometheus MySQL Exporter
After=network-online.target
[Service]
User=mysqld_exporter
Group=mysqld_exporter
ExecStart=/usr/local/bin/mysqld_exporter \
--config.my-cnf=/etc/mysqld_exporter/.my.cnf \
--web.listen-address=:9104
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetkey PromQL#
# MySQL exporter cannot connect
mysql_up == 0# MySQL connections usage
mysql_global_status_threads_connected / mysql_global_variables_max_connections# slow queries increased
increase(mysql_global_status_slow_queries[5m]) > 012. mongodb_exporter#
best practices#
use dedicated MongoDB user
grant clusterMonitor
use TLS when MongoDB requires TLS
avoid exposing URI in process args when possible
monitor replication lag, connections, opcounters, locks, storageMongoDB user#
use admin
db.createUser({
user: "mongodb_exporter",
pwd: "change-me",
roles: [
{ role: "clusterMonitor", db: "admin" },
{ role: "read", db: "local" }
]
})docker hands-on#
# .env
MONGODB_URI=mongodb://mongodb_exporter:change-me@mongo.example.com:27017/admin?ssl=trueservices:
mongodb_exporter:
image: percona/mongodb_exporter:latest
container_name: mongodb_exporter
restart: unless-stopped
ports:
- "9216:9216"
command:
- "--mongodb.uri=${MONGODB_URI}"key PromQL#
# MongoDB exporter cannot connect
mongodb_up == 0# MongoDB connections usage
mongodb_connections{state="current"} / mongodb_connections{state="available"}13. nginx-prometheus-exporter#
best practices#
enable stub_status for open source NGINX
bind stub_status to localhost or exporter-only network
do not expose /nginx_status publicly
for NGINX Plus, use Plus API instead of stub_statusNGINX stub_status#
server {
listen 127.0.0.1:8080;
location /nginx_status {
stub_status;
allow 127.0.0.1;
deny all;
}
}docker hands-on#
services:
nginx_exporter:
image: nginx/nginx-prometheus-exporter:latest
container_name: nginx_exporter
restart: unless-stopped
network_mode: host
command:
- "-nginx.scrape-uri=http://127.0.0.1:8080/nginx_status"key PromQL#
# exporter cannot scrape NGINX
nginx_up == 0# active NGINX connections
nginx_connections_active# accepted connections rate
rate(nginx_connections_accepted[5m])14. jmx_exporter#
when to use#
use for:
Kafka
Cassandra
JVM application
Java middleware exposing JMX MBeans
prefer:
javaagent mode
avoid:
unauthenticated remote JMX exposed on networkJava agent hands-on#
# /opt/jmx_exporter/config.yml
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: 'java.lang<type=Memory><HeapMemoryUsage>Used'
name: jvm_memory_heap_used_bytes
type: GAUGE
- pattern: 'java.lang<type=Memory><HeapMemoryUsage>Max'
name: jvm_memory_heap_max_bytes
type: GAUGE
- pattern: 'java.lang<type=GarbageCollector,name=(.+)><>CollectionTime'
name: jvm_gc_collection_time_milliseconds
labels:
gc: "$1"
type: COUNTERjava \
-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/jmx_exporter/config.yml \
-jar app.jarkey PromQL#
# JVM heap usage ratio
jvm_memory_heap_used_bytes / jvm_memory_heap_max_bytes# GC time increased
rate(jvm_gc_collection_time_milliseconds[5m])15. Production Checklist#
security:
exporter endpoints only private network
no public /metrics
no secrets in labels
no password in command line
config file permission 0600 for credentials
use TLS when exporter connects to remote DB/cache
cardinality:
avoid user_id / request_id / full_path labels
avoid per-key Redis metrics in production
avoid per-container labels that change often
drop labels that are not used by alert/dashboard
scrape:
node/process/blackbox: 15s-60s
database/cache: 30s-60s
AWS/YACE: 300s by default
scrape_timeout < scrape_interval
alerting:
up == 0 for every exporter
alert on target health first
alert on saturation second
alert on trends only when action is clear
operations:
run exporter as non-root when possible
pin versions in production
record dashboard json or panel definitions
keep exporter config in git
document owner and runbook_url per job16. Quick Smoke Test#
# check local exporter endpoint
curl -s http://127.0.0.1:9100/metrics | head
# check exporter target from Prometheus / vmagent host
curl -s http://monitoring-ec2:9100/metrics | head
# check metric exists in VictoriaMetrics
curl -G "http://victoriametrics:8428/api/v1/query" \
--data-urlencode 'query=up'# all scrape targets and their health
up# targets down by job
sum by (job) (up == 0)