https://prometheus.io/docs/instrumenting/exporters/
https://github.com/prometheus/node_exporter
https://github.com/ncabatoff/process-exporter
https://github.com/prometheus/blackbox_exporter
https://github.com/prometheus-community/yet-another-cloudwatch-exporter
https://github.com/prometheus-community/postgres_exporter
https://github.com/oliver006/redis_exporter
https://github.com/prometheus/mysqld_exporter
https://github.com/percona/mongodb_exporter
https://github.com/nginx/nginx-prometheus-exporter
https://github.com/prometheus/jmx_exporter
https://prometheus.io/docs/guides/cadvisor/

1. Important Points#

Exporter 是 metrics adapter:
    read from OS / app / database / cloud API
    expose /metrics in Prometheus format
    Prometheus / vmagent / VictoriaMetrics scrape it

核心原则:
    exporter should run close to target
    exporter endpoint should not be public internet
    exporter credentials must be least privilege
    do not export high-cardinality labels
    do not enable all collectors blindly
    scrape interval usually starts from 30s or 60s
    AWS CloudWatch exporter usually starts from 300s to reduce cost
    every exporter must have an up{} alert

不要把 exporter 当成日志系统:

metrics:
    counters
    gauges
    histograms
    low cardinality labels

not metrics:
    raw logs
    request body
    user id
    token
    SQL text
    full URL with query string

2. Exporter Selection#

Scope Exporter Port Use For Notes
Linux VM / EC2 node_exporter 9100 CPU, memory, disk, network, filesystem every VM should have it
Process process-exporter 9256 named process count, CPU, memory useful for Grafana / VM / Alertmanager process check
HTTP / TCP / DNS / TLS blackbox_exporter 9115 external user-view availability use for ALB, CloudFront, API health, TLS expiry
AWS CloudWatch YACE 5000 RDS, ECS, ALB, SQS, DynamoDB, CloudFront metrics cost depends on CloudWatch API calls
Container on VM cAdvisor 8080 Docker/container CPU, memory, network useful on plain Docker host, not always needed on ECS/Fargate
PostgreSQL postgres_exporter 9187 connections, locks, tx, replication, table stats use read-only user with pg_monitor
Valkey / Redis redis_exporter 9121 memory, connected clients, evictions, hit rate avoid expensive key scanning in prod
MySQL / MariaDB mysqld_exporter 9104 connections, QPS, InnoDB, replication use dedicated low-privilege user
MongoDB mongodb_exporter 9216 replication, sharding, storage, op counters use clusterMonitor role
NGINX nginx-prometheus-exporter 9113 active connections, requests, NGINX Plus metrics open stub_status only to localhost/exporter
JVM / Kafka / Cassandra jmx_exporter custom JMX MBeans to Prometheus prefer javaagent mode
Kubernetes object state kube-state-metrics 8080 pod/deployment/node object status for Kubernetes, not ECS
Windows VM windows_exporter 9182 Windows CPU, memory, disk, services only if Windows hosts exist

最小生产组合:

EC2 monitoring host:
    node_exporter
    process-exporter
    blackbox_exporter

AWS managed services:
    YACE
    blackbox_exporter for user-view probes

self-managed database/cache:
    postgres_exporter / redis_exporter / mysqld_exporter / mongodb_exporter

Java middleware:
    jmx_exporter

3. Common Scrape Config#

vmagent / Prometheus scrape config 示例:

global:
  scrape_interval: 30s
  scrape_timeout: 10s

scrape_configs:
  - job_name: node
    static_configs:
      - targets:
          - monitoring-ec2:9100
          - app-ec2-1:9100

  - job_name: process
    static_configs:
      - targets:
          - monitoring-ec2:9256

  - job_name: yace
    scrape_interval: 300s
    scrape_timeout: 60s
    static_configs:
      - targets:
          - monitoring-ec2:5000

  - job_name: blackbox-http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://www.example.com/health
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: monitoring-ec2:9115

每个 exporter 都应该有基础告警:

groups:
  - name: exporter.rules
    rules:
      - alert: ExporterDown
        expr: up == 0
        for: 3m
        labels:
          severity: P1
          component: exporter
        annotations:
          summary: "Exporter target is down: {{ $labels.job }} {{ $labels.instance }}"

4. node_exporter#

when to use#

use for:
    EC2
    Linux VM
    self-managed monitoring host
    database host
    cache host

do not use for:
    managed AWS service itself
    Fargate task OS metrics

best practices#

bind to private network or localhost with reverse proxy
exclude container overlay filesystems
exclude tmpfs/dev/proc/sys mountpoints
do not enable expensive collectors unless needed
use textfile collector for custom host-level metrics
alert on disk, memory, CPU, instance down

systemd hands-on#

# /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=:9100 \
  --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run/credentials/.+|var/lib/docker/.+|var/lib/containers/storage/.+|var/lib/kubelet/.+)($|/) \
  --collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
sudo useradd --system --no-create-home --shell /usr/sbin/nologin node_exporter
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
curl -s http://127.0.0.1:9100/metrics | head

key PromQL#

# host down
up{job="node"} == 0
# disk usage percent
100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})
# memory usage percent
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# CPU usage percent
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100

5. process-exporter#

when to use#

use for:
    verify Grafana / Alertmanager / VictoriaMetrics process exists
    monitor legacy VM app process
    track process CPU / memory by logical name

do not use for:
    every short-lived process
    per-user process inventory

config hands-on#

# /etc/process-exporter/config.yml
process_names:
  - name: "victoriametrics"
    exe:
      - /usr/local/bin/victoria-metrics

  - name: "alertmanager"
    exe:
      - /usr/local/bin/alertmanager

  - name: "grafana"
    cmdline:
      - "grafana-server"

  - name: "spug-gunicorn"
    exe:
      - /usr/bin/python3
    cmdline:
      - "gunicorn"
      - "spug.wsgi"

comm / exe / cmdline 的选择:

ps -eo pid,comm,args
comm:
    from /proc/<pid>/comm
    short process name

exe:
    executable path
    stable when binary path is fixed

cmdline:
    command line args
    useful for Python / Java / Node.js apps

systemd hands-on#

# /etc/systemd/system/process-exporter.service
[Unit]
Description=Prometheus Process Exporter
After=network-online.target

[Service]
User=process_exporter
Group=process_exporter
ExecStart=/usr/local/bin/process-exporter \
  --web.listen-address=:9256 \
  --config.path=/etc/process-exporter/config.yml
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
sudo useradd --system --no-create-home --shell /usr/sbin/nologin process_exporter
sudo install -d -m 0755 /etc/process-exporter
sudo systemctl daemon-reload
sudo systemctl enable --now process-exporter
curl -s http://127.0.0.1:9256/metrics | grep namedprocess_namegroup_num_procs

key PromQL#

# named process missing
namedprocess_namegroup_num_procs{groupname="victoriametrics"} == 0
# process memory usage
namedprocess_namegroup_memory_bytes{groupname="victoriametrics",memtype="resident"}

6. blackbox_exporter#

when to use#

use for:
    public website
    API health endpoint
    ALB DNS
    CloudFront domain
    TCP port
    DNS resolution
    TLS certificate expiry

blackbox_exporter answers:
    can user reach this endpoint?
    is DNS working?
    is TLS valid?
    is response status expected?

config hands-on#

# /etc/blackbox_exporter/blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      method: GET
      preferred_ip_protocol: ip4
      valid_http_versions:
        - HTTP/1.1
        - HTTP/2.0
      valid_status_codes:
        - 200
        - 204
      fail_if_ssl: false
      fail_if_not_ssl: true

  tcp_connect:
    prober: tcp
    timeout: 5s

  dns_lookup:
    prober: dns
    timeout: 5s
    dns:
      query_name: example.com
      query_type: A
# docker-compose.yml
services:
  blackbox_exporter:
    image: quay.io/prometheus/blackbox-exporter:latest
    container_name: blackbox_exporter
    restart: unless-stopped
    ports:
      - "9115:9115"
    volumes:
      - ./blackbox.yml:/config/blackbox.yml:ro
    command:
      - "--config.file=/config/blackbox.yml"

scrape config#

scrape_configs:
  - job_name: blackbox-http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://www.example.com/health
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: monitoring-ec2:9115

key PromQL#

# endpoint unavailable
probe_success{job="blackbox-http"} == 0
# HTTP probe latency p95 from probe duration gauge
probe_duration_seconds{job="blackbox-http"} > 2
# TLS certificate expires in less than 14 days
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 14

7. YACE#

when to use#

use for:
    AWS/RDS
    AWS/ECS
    AWS/ApplicationELB
    AWS/SQS
    AWS/DynamoDB
    AWS/ElastiCache
    AWS/CloudFront

do not use for:
    application custom metrics
    per-request high-cardinality data

YACE 成本注意:

CloudWatch API calls cost money
period should usually start at 300s
length should usually be 600s or 900s
only scrape metrics used by alerts or dashboards
tag discovery is useful but must keep labels controlled

IAM policy minimal sample#

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:GetMetricData",
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:ListMetrics",
        "tag:GetResources"
      ],
      "Resource": "*"
    }
  ]
}

config hands-on#

# /etc/yace/config.yml
apiVersion: v1alpha1
sts-region: ap-east-1
discovery:
  jobs:
    - type: AWS/RDS
      regions:
        - ap-east-1
      period: 300
      length: 600
      nilToZero: true
      metrics:
        - name: CPUUtilization
          statistics: [Average]
        - name: FreeableMemory
          statistics: [Average]
        - name: DatabaseConnections
          statistics: [Average]
        - name: ReadLatency
          statistics: [Average]
        - name: WriteLatency
          statistics: [Average]

    - type: AWS/SQS
      regions:
        - ap-east-1
      period: 300
      length: 600
      nilToZero: true
      metrics:
        - name: ApproximateAgeOfOldestMessage
          statistics: [Maximum]
        - name: ApproximateNumberOfMessagesVisible
          statistics: [Average]
        - name: NumberOfMessagesDeleted
          statistics: [Sum]

    - type: AWS/ApplicationELB
      regions:
        - ap-east-1
      period: 300
      length: 600
      nilToZero: true
      metrics:
        - name: RequestCount
          statistics: [Sum]
        - name: HTTPCode_Target_5XX_Count
          statistics: [Sum]
        - name: HealthyHostCount
          statistics: [Minimum]
        - name: TargetResponseTime
          statistics: [p95]

docker hands-on#

services:
  yace:
    image: ghcr.io/nerdswords/yet-another-cloudwatch-exporter:latest
    container_name: yace
    restart: unless-stopped
    ports:
      - "5000:5000"
    volumes:
      - ./config.yml:/config/config.yml:ro
    environment:
      AWS_REGION: ap-east-1
    command:
      - "--config.file=/config/config.yml"

verify#

curl -s http://127.0.0.1:5000/metrics | grep aws_rds_cpu_utilization_average

8. cAdvisor#

when to use#

use for:
    plain Docker host
    self-managed container host
    local lab

be careful:
    cAdvisor can expose many labels
    container labels may increase cardinality
    in ECS/Fargate, prefer ECS/CloudWatch metrics plus app /metrics first

docker hands-on#

services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg

key PromQL#

# container CPU usage seconds rate
sum by (container_label_com_docker_compose_service) (
  rate(container_cpu_usage_seconds_total{image!=""}[5m])
)
# container memory usage bytes
container_memory_working_set_bytes{image!=""}

9. postgres_exporter#

best practices#

use dedicated exporter user
grant pg_monitor
do not use superuser
use sslmode=require for remote database
do not put password in command line
store DSN in EnvironmentFile or secret manager

PostgreSQL user#

CREATE USER postgres_exporter WITH PASSWORD 'change-me';
GRANT pg_monitor TO postgres_exporter;

systemd hands-on#

# /etc/postgres_exporter/postgres_exporter.env
DATA_SOURCE_NAME=postgresql://postgres_exporter:change-me@db.example.com:5432/postgres?sslmode=require
# /etc/systemd/system/postgres_exporter.service
[Unit]
Description=Prometheus PostgreSQL Exporter
After=network-online.target

[Service]
User=postgres_exporter
Group=postgres_exporter
EnvironmentFile=/etc/postgres_exporter/postgres_exporter.env
ExecStart=/usr/local/bin/postgres_exporter --web.listen-address=:9187
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
sudo useradd --system --no-create-home --shell /usr/sbin/nologin postgres_exporter
sudo install -d -m 0750 -o postgres_exporter -g postgres_exporter /etc/postgres_exporter
sudo systemctl daemon-reload
sudo systemctl enable --now postgres_exporter
curl -s http://127.0.0.1:9187/metrics | grep pg_up

key PromQL#

# PostgreSQL exporter cannot connect
pg_up == 0
# active connections by database
pg_stat_database_numbackends
# deadlocks increased in last 5 minutes
increase(pg_stat_database_deadlocks[5m]) > 0

10. redis_exporter For Valkey / Redis#

best practices#

use a read-only or limited ACL user if possible
do not enable key-level metrics in production unless bounded
monitor evictions, memory, connected clients, hit rate, rejected connections
put password in env file, not command line

Valkey / Redis ACL user#

ACL SETUSER redis_exporter on >change-me +client +config|get +info +latency +slowlog +ping

docker hands-on#

services:
  redis_exporter:
    image: oliver006/redis_exporter:latest
    container_name: redis_exporter
    restart: unless-stopped
    ports:
      - "9121:9121"
    environment:
      REDIS_ADDR: "redis://valkey.example.com:6379"
      REDIS_USER: "redis_exporter"
      REDIS_PASSWORD: "change-me"

key PromQL#

# exporter cannot reach Redis / Valkey
redis_up == 0
# memory usage ratio
redis_memory_used_bytes / redis_memory_max_bytes
# evictions happened
increase(redis_evicted_keys_total[5m]) > 0
# cache hit rate
rate(redis_keyspace_hits_total[5m])
/
clamp_min(rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]), 1)

11. mysqld_exporter#

best practices#

use dedicated exporter user
use config file permissions 0600
enable only needed collectors
monitor connection count, slow queries, InnoDB, replication, aborted connects

MySQL user#

CREATE USER 'mysqld_exporter'@'%' IDENTIFIED BY 'change-me' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'mysqld_exporter'@'%';
FLUSH PRIVILEGES;

config hands-on#

# /etc/mysqld_exporter/.my.cnf
[client]
user=mysqld_exporter
password=change-me
host=mysql.example.com
port=3306
ssl-mode=REQUIRED
# /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=Prometheus MySQL Exporter
After=network-online.target

[Service]
User=mysqld_exporter
Group=mysqld_exporter
ExecStart=/usr/local/bin/mysqld_exporter \
  --config.my-cnf=/etc/mysqld_exporter/.my.cnf \
  --web.listen-address=:9104
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

key PromQL#

# MySQL exporter cannot connect
mysql_up == 0
# MySQL connections usage
mysql_global_status_threads_connected / mysql_global_variables_max_connections
# slow queries increased
increase(mysql_global_status_slow_queries[5m]) > 0

12. mongodb_exporter#

best practices#

use dedicated MongoDB user
grant clusterMonitor
use TLS when MongoDB requires TLS
avoid exposing URI in process args when possible
monitor replication lag, connections, opcounters, locks, storage

MongoDB user#

use admin
db.createUser({
  user: "mongodb_exporter",
  pwd: "change-me",
  roles: [
    { role: "clusterMonitor", db: "admin" },
    { role: "read", db: "local" }
  ]
})

docker hands-on#

# .env
MONGODB_URI=mongodb://mongodb_exporter:change-me@mongo.example.com:27017/admin?ssl=true
services:
  mongodb_exporter:
    image: percona/mongodb_exporter:latest
    container_name: mongodb_exporter
    restart: unless-stopped
    ports:
      - "9216:9216"
    command:
      - "--mongodb.uri=${MONGODB_URI}"

key PromQL#

# MongoDB exporter cannot connect
mongodb_up == 0
# MongoDB connections usage
mongodb_connections{state="current"} / mongodb_connections{state="available"}

13. nginx-prometheus-exporter#

best practices#

enable stub_status for open source NGINX
bind stub_status to localhost or exporter-only network
do not expose /nginx_status publicly
for NGINX Plus, use Plus API instead of stub_status

NGINX stub_status#

server {
    listen 127.0.0.1:8080;

    location /nginx_status {
        stub_status;
        allow 127.0.0.1;
        deny all;
    }
}

docker hands-on#

services:
  nginx_exporter:
    image: nginx/nginx-prometheus-exporter:latest
    container_name: nginx_exporter
    restart: unless-stopped
    network_mode: host
    command:
      - "-nginx.scrape-uri=http://127.0.0.1:8080/nginx_status"

key PromQL#

# exporter cannot scrape NGINX
nginx_up == 0
# active NGINX connections
nginx_connections_active
# accepted connections rate
rate(nginx_connections_accepted[5m])

14. jmx_exporter#

when to use#

use for:
    Kafka
    Cassandra
    JVM application
    Java middleware exposing JMX MBeans

prefer:
    javaagent mode

avoid:
    unauthenticated remote JMX exposed on network

Java agent hands-on#

# /opt/jmx_exporter/config.yml
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  - pattern: 'java.lang<type=Memory><HeapMemoryUsage>Used'
    name: jvm_memory_heap_used_bytes
    type: GAUGE

  - pattern: 'java.lang<type=Memory><HeapMemoryUsage>Max'
    name: jvm_memory_heap_max_bytes
    type: GAUGE

  - pattern: 'java.lang<type=GarbageCollector,name=(.+)><>CollectionTime'
    name: jvm_gc_collection_time_milliseconds
    labels:
      gc: "$1"
    type: COUNTER
java \
  -javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/jmx_exporter/config.yml \
  -jar app.jar

key PromQL#

# JVM heap usage ratio
jvm_memory_heap_used_bytes / jvm_memory_heap_max_bytes
# GC time increased
rate(jvm_gc_collection_time_milliseconds[5m])

15. Production Checklist#

security:
    exporter endpoints only private network
    no public /metrics
    no secrets in labels
    no password in command line
    config file permission 0600 for credentials
    use TLS when exporter connects to remote DB/cache

cardinality:
    avoid user_id / request_id / full_path labels
    avoid per-key Redis metrics in production
    avoid per-container labels that change often
    drop labels that are not used by alert/dashboard

scrape:
    node/process/blackbox: 15s-60s
    database/cache: 30s-60s
    AWS/YACE: 300s by default
    scrape_timeout < scrape_interval

alerting:
    up == 0 for every exporter
    alert on target health first
    alert on saturation second
    alert on trends only when action is clear

operations:
    run exporter as non-root when possible
    pin versions in production
    record dashboard json or panel definitions
    keep exporter config in git
    document owner and runbook_url per job

16. Quick Smoke Test#

# check local exporter endpoint
curl -s http://127.0.0.1:9100/metrics | head

# check exporter target from Prometheus / vmagent host
curl -s http://monitoring-ec2:9100/metrics | head

# check metric exists in VictoriaMetrics
curl -G "http://victoriametrics:8428/api/v1/query" \
  --data-urlencode 'query=up'
# all scrape targets and their health
up
# targets down by job
sum by (job) (up == 0)