Alerting


https://prometheus.io/docs/practices/alerting/
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
https://prometheus.io/docs/alerting/latest/alertmanager/
https://docs.victoriametrics.com/vmalert/
https://grafana.com/docs/grafana/latest/alerting/
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

1. Important Points#

Alerting 不是把所有异常都推到群里,而是把需要人处理的事情用稳定、可路由、可排查的格式发出来。

good alert:
    actionable
    has owner
    has severity
    has resource identity
    has dashboard / runbook
    has enough labels for routing
    fires only when human action is useful

bad alert:
    no owner
    no threshold reason
    no recovery notification
    no runbook
    noisy low-value warning
    only says "something is wrong"

统一规范应该覆盖所有告警系统:

CloudWatch Alarm
Prometheus alerting rules
vmalert rules
Grafana Alerting
synthetic checks
log-based alerts

2. Severity#

Severity Meaning Notify
P0 user-facing outage / data loss / security emergency page immediately
P1 production degraded or important capability broken on-call / incident channel
P2 risk or partial issue, should be handled soon team channel / work queue
P3 low priority, trend, cleanup, capacity planning ticket / report
P0:
    service unavailable
    no healthy target
    critical data path down
    high-severity security incident

P1:
    high 5xx rate
    task below desired count
    database unavailable replica / failover issue
    queue age violates SLA

P2:
    high CPU/memory for sustained time
    unhealthy target exists but capacity remains
    certificate expiring soon
    scaling or quota risk

P3:
    cost anomaly follow-up
    noisy dependency
    non-production issue

3. Alert Naming#

Recommended format:

<severity>-<env>-<service>-<resource_type>-<resource>-<signal>

Examples:

P0-prod-alb-targetgroup-api-HealthyHostCount
P1-prod-alb-targetgroup-api-Target5xxRate
P1-prod-ecs-service-api-TaskBelowDesired
P2-prod-ecs-service-api-CPUUtilizationHigh
P1-prod-aurora-cluster-main-CPUUtilization
P2-uat-sqs-queue-worker-OldestMessageAge

Why severity first:

console sorting:
    P0/P1 are easier to scan

routing:
    prefix matching is simple

incident review:
    export / search / dashboard grouping is easier

Rules:

use:
    stable env name: prod / uat / dev
    stable service name: api / worker / payment
    stable resource type: alb / targetgroup / ecs-service / queue / cluster
    meaningful signal name: Target5xxRate / HealthyHostCount / OldestMessageAge

avoid:
    random resource id only
    human name only
    severity hidden at the end
    long sentence as alarm name

4. Standard Labels#

Every alert should carry these labels or fields.

severity:
    P0 / P1 / P2 / P3

env:
    prod / uat / dev

service:
    api / worker / payment / platform

team:
    owning team or on-call route

resource_type:
    alb / targetgroup / ecs-service / sqs-queue / rds-cluster

resource:
    stable resource name or id

region:
    cloud region when applicable

account:
    cloud account id or tenant/project id

Prometheus/vmalert labels:

labels:
  severity: P1
  env: prod
  service: api
  team: platform
  resource_type: targetgroup
  resource: prod-api

CloudWatch payload should normalize the same fields in Lambda.

5. Notification Routing#

General routing path:

alert source
  -> routing layer
  -> notification channel
  -> incident / ticket / automation

AWS path:

CloudWatch Alarm
  -> SNS Topic
  -> Lambda notification router
  -> Slack / Teams / PagerDuty / Opsgenie / webhook

Prometheus / vmalert path:

Prometheus / vmalert
  -> Alertmanager
  -> receiver
  -> Slack / Teams / PagerDuty / webhook

Grafana path:

Grafana Alerting
  -> notification policy
  -> contact point
  -> Slack / Teams / PagerDuty / webhook

Routing rule:

P0/P1 prod:
    on-call page + incident channel

P2 prod:
    team channel + ticket

P3:
    ticket / report only

dev/uat:
    team channel only, unless explicitly critical

6. Standard Payload#

Normalized notification payload:

source: cloudwatch / prometheus / vmalert / grafana
severity: P1
env: prod
service: api
team: platform
alert_name: P1-prod-alb-targetgroup-api-Target5xxRate
state: firing / resolved / ALARM / OK / INSUFFICIENT_DATA
previous_state: OK / ALARM / INSUFFICIENT_DATA
account: 111122223333
region: ap-east-1
namespace: AWS/ApplicationELB
resource_type: targetgroup
resource: targetgroup/prod-api/def456
signal: Target5xxRate
summary: short human-readable message
reason: alarm reason or expression value
started_at: 2026-06-04T10:20:30Z
dashboard_url: https://...
runbook_url: https://...

Why these fields matter:

account / region:
    required for multi-account and multi-region operations

service / team:
    required for routing and ownership

state / previous_state:
    required to distinguish firing and recovery

resource_type / resource:
    required for targeted runbook links and diagnostics

dashboard / runbook:
    reduces incident response time

7. Dashboard And Runbook#

Every P0/P1 alert should have:

dashboard:
    service overview
    metric that fired
    dependency metrics
    logs link if available

runbook:
    meaning
    common causes
    first 5 commands / queries
    rollback or mitigation
    escalation owner

Runbook template:

alert:
    P1-prod-alb-targetgroup-api-Target5xxRate

meaning:
    target returned too many 5xx responses

first checks:
    recent deployment
    target health
    app logs
    dependency errors

mitigation:
    rollback release
    scale service
    disable bad target
    enable maintenance page if needed

owner:
    platform on-call

8. State Handling#

CloudWatch states:

ALARM:
    condition is breaching

OK:
    recovered

INSUFFICIENT_DATA:
    metric is missing or alarm cannot evaluate

Prometheus / vmalert states:

pending:
    expression true but for duration not met

firing:
    expression true for required duration

resolved:
    expression no longer true

Production rule:

send firing notification:
    P0/P1/P2 based on route

send recovery notification:
    P0/P1 always
    P2 optional

handle missing data:
    metric-pipeline alerts should fire when data disappears
    business metric alerts may treat missing data as not breaching

9. Common Mistakes#

mistake:
    alert has no owner label
result:
    everyone sees it, nobody owns it

mistake:
    threshold is copied from another service
result:
    noisy or missed incident

mistake:
    only absolute count for error alert
result:
    high traffic service pages constantly, low traffic service misses real issue

mistake:
    no resolved notification
result:
    people do not know whether incident recovered

mistake:
    no runbook link
result:
    incident response starts from searching docs

mistake:
    dashboard link is missing account/region context
result:
    on-call opens the wrong environment

10. Production Checklist#

naming:
    alert names follow standard format
    severity is first
    env/service/resource_type/resource/signal are present

routing:
    P0/P1 route to on-call
    P2/P3 route to team channel or ticket
    prod and non-prod routes are separated

payload:
    account and region included
    state and previous_state included
    dashboard_url and runbook_url included

quality:
    every P0/P1 is actionable
    noisy alerts reviewed monthly
    thresholds tied to SLO or operational reason
    missing-data behavior is explicit