Links#
https://prometheus.io/docs/practices/alerting/
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
https://prometheus.io/docs/alerting/latest/alertmanager/
https://docs.victoriametrics.com/vmalert/
https://grafana.com/docs/grafana/latest/alerting/
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html1. Important Points#
Alerting 不是把所有异常都推到群里,而是把需要人处理的事情用稳定、可路由、可排查的格式发出来。
good alert:
actionable
has owner
has severity
has resource identity
has dashboard / runbook
has enough labels for routing
fires only when human action is useful
bad alert:
no owner
no threshold reason
no recovery notification
no runbook
noisy low-value warning
only says "something is wrong"统一规范应该覆盖所有告警系统:
CloudWatch Alarm
Prometheus alerting rules
vmalert rules
Grafana Alerting
synthetic checks
log-based alerts2. Severity#
| Severity | Meaning | Notify |
|---|---|---|
| P0 | user-facing outage / data loss / security emergency | page immediately |
| P1 | production degraded or important capability broken | on-call / incident channel |
| P2 | risk or partial issue, should be handled soon | team channel / work queue |
| P3 | low priority, trend, cleanup, capacity planning | ticket / report |
P0:
service unavailable
no healthy target
critical data path down
high-severity security incident
P1:
high 5xx rate
task below desired count
database unavailable replica / failover issue
queue age violates SLA
P2:
high CPU/memory for sustained time
unhealthy target exists but capacity remains
certificate expiring soon
scaling or quota risk
P3:
cost anomaly follow-up
noisy dependency
non-production issue3. Alert Naming#
Recommended format:
<severity>-<env>-<service>-<resource_type>-<resource>-<signal>Examples:
P0-prod-alb-targetgroup-api-HealthyHostCount
P1-prod-alb-targetgroup-api-Target5xxRate
P1-prod-ecs-service-api-TaskBelowDesired
P2-prod-ecs-service-api-CPUUtilizationHigh
P1-prod-aurora-cluster-main-CPUUtilization
P2-uat-sqs-queue-worker-OldestMessageAgeWhy severity first:
console sorting:
P0/P1 are easier to scan
routing:
prefix matching is simple
incident review:
export / search / dashboard grouping is easierRules:
use:
stable env name: prod / uat / dev
stable service name: api / worker / payment
stable resource type: alb / targetgroup / ecs-service / queue / cluster
meaningful signal name: Target5xxRate / HealthyHostCount / OldestMessageAge
avoid:
random resource id only
human name only
severity hidden at the end
long sentence as alarm name4. Standard Labels#
Every alert should carry these labels or fields.
severity:
P0 / P1 / P2 / P3
env:
prod / uat / dev
service:
api / worker / payment / platform
team:
owning team or on-call route
resource_type:
alb / targetgroup / ecs-service / sqs-queue / rds-cluster
resource:
stable resource name or id
region:
cloud region when applicable
account:
cloud account id or tenant/project idPrometheus/vmalert labels:
labels:
severity: P1
env: prod
service: api
team: platform
resource_type: targetgroup
resource: prod-apiCloudWatch payload should normalize the same fields in Lambda.
5. Notification Routing#
General routing path:
alert source
-> routing layer
-> notification channel
-> incident / ticket / automationAWS path:
CloudWatch Alarm
-> SNS Topic
-> Lambda notification router
-> Slack / Teams / PagerDuty / Opsgenie / webhookPrometheus / vmalert path:
Prometheus / vmalert
-> Alertmanager
-> receiver
-> Slack / Teams / PagerDuty / webhookGrafana path:
Grafana Alerting
-> notification policy
-> contact point
-> Slack / Teams / PagerDuty / webhookRouting rule:
P0/P1 prod:
on-call page + incident channel
P2 prod:
team channel + ticket
P3:
ticket / report only
dev/uat:
team channel only, unless explicitly critical6. Standard Payload#
Normalized notification payload:
source: cloudwatch / prometheus / vmalert / grafana
severity: P1
env: prod
service: api
team: platform
alert_name: P1-prod-alb-targetgroup-api-Target5xxRate
state: firing / resolved / ALARM / OK / INSUFFICIENT_DATA
previous_state: OK / ALARM / INSUFFICIENT_DATA
account: 111122223333
region: ap-east-1
namespace: AWS/ApplicationELB
resource_type: targetgroup
resource: targetgroup/prod-api/def456
signal: Target5xxRate
summary: short human-readable message
reason: alarm reason or expression value
started_at: 2026-06-04T10:20:30Z
dashboard_url: https://...
runbook_url: https://...Why these fields matter:
account / region:
required for multi-account and multi-region operations
service / team:
required for routing and ownership
state / previous_state:
required to distinguish firing and recovery
resource_type / resource:
required for targeted runbook links and diagnostics
dashboard / runbook:
reduces incident response time7. Dashboard And Runbook#
Every P0/P1 alert should have:
dashboard:
service overview
metric that fired
dependency metrics
logs link if available
runbook:
meaning
common causes
first 5 commands / queries
rollback or mitigation
escalation ownerRunbook template:
alert:
P1-prod-alb-targetgroup-api-Target5xxRate
meaning:
target returned too many 5xx responses
first checks:
recent deployment
target health
app logs
dependency errors
mitigation:
rollback release
scale service
disable bad target
enable maintenance page if needed
owner:
platform on-call8. State Handling#
CloudWatch states:
ALARM:
condition is breaching
OK:
recovered
INSUFFICIENT_DATA:
metric is missing or alarm cannot evaluatePrometheus / vmalert states:
pending:
expression true but for duration not met
firing:
expression true for required duration
resolved:
expression no longer trueProduction rule:
send firing notification:
P0/P1/P2 based on route
send recovery notification:
P0/P1 always
P2 optional
handle missing data:
metric-pipeline alerts should fire when data disappears
business metric alerts may treat missing data as not breaching9. Common Mistakes#
mistake:
alert has no owner label
result:
everyone sees it, nobody owns it
mistake:
threshold is copied from another service
result:
noisy or missed incident
mistake:
only absolute count for error alert
result:
high traffic service pages constantly, low traffic service misses real issue
mistake:
no resolved notification
result:
people do not know whether incident recovered
mistake:
no runbook link
result:
incident response starts from searching docs
mistake:
dashboard link is missing account/region context
result:
on-call opens the wrong environment10. Production Checklist#
naming:
alert names follow standard format
severity is first
env/service/resource_type/resource/signal are present
routing:
P0/P1 route to on-call
P2/P3 route to team channel or ticket
prod and non-prod routes are separated
payload:
account and region included
state and previous_state included
dashboard_url and runbook_url included
quality:
every P0/P1 is actionable
noisy alerts reviewed monthly
thresholds tied to SLO or operational reason
missing-data behavior is explicit