Links#
https://grafana.com/docs/grafana/latest/alerting/
https://grafana.com/docs/grafana/latest/alerting/fundamentals/alert-rules/
https://grafana.com/docs/grafana/latest/alerting/fundamentals/notification-policies/
https://grafana.com/docs/grafana/latest/alerting/fundamentals/contact-points/
https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/1. Important Points#
Grafana Alerting is useful when teams already manage dashboards in Grafana and need alerts from multiple data sources. It can alert on Prometheus, Loki, CloudWatch, SQL data sources and more.
Grafana Alerting 用来做:
dashboard-adjacent alerting
multi-datasource alerts
contact points
notification policies
UI-managed or provisioned alert rules
Grafana Alerting 不适合:
replacing Prometheus/vmalert for very large PromQL rule sets
uncontrolled click-ops in production
alerts without provisioning / review processRecommended split:
Prometheus / vmalert:
infrastructure and service metric rules at scale
Grafana Alerting:
dashboard-owned alerts
multi-source alerts
small team-managed alerts
visual alert review2. Core Concepts#
| Concept | Meaning |
|---|---|
| Alert rule | condition evaluated by Grafana |
| Folder | rule organization and permissions |
| Evaluation group | rule evaluation interval grouping |
| Contact point | Slack, Teams, PagerDuty, webhook, email |
| Notification policy | routing tree |
| Label | routing/grouping metadata |
| Annotation | human-readable context |
| Silence | temporary mute |
3. Label Standard#
Use the same labels as the global alerting standard:
severity
env
service
team
resource_type
resourceExample labels:
severity: P1
env: prod
service: api
team: platform
resource_type: service
resource: apiAnnotations:
summary: API 5xx rate is high
description: API 5xx rate is above 5% for 5 minutes.
dashboard_url: https://grafana.example.com/d/api
runbook_url: https://docs.example.com/runbooks/api-5xx4. Notification Policies#
Recommended policy tree:
root policy:
default contact point = platform-warning
routes:
severity=P0|P1, env=prod
-> pagerduty-prod
-> continue to team channel
team=platform
-> platform-slack
env=dev|uat
-> nonprod-alertsPrinciples:
route by labels, not alert names
send resolved notifications for P0/P1
avoid one contact point per alert
use mute timings for planned maintenance
use folders and provisioning for ownership5. Provisioning Pattern#
Keep production alerting in code.
grafana/
├── provisioning/
│ └── alerting/
│ ├── contact-points.yaml
│ ├── notification-policies.yaml
│ └── rules-api.yamlExample rule provisioning shape:
apiVersion: 1
groups:
- orgId: 1
name: api.rules
folder: Production
interval: 1m
rules:
- uid: p1-prod-api-high-5xx
title: P1-prod-api-service-api-High5xxRate
condition: C
labels:
severity: P1
env: prod
service: api
team: platform
resource_type: service
resource: api
annotations:
summary: API 5xx rate is high
runbook_url: https://docs.example.com/runbooks/api-5xx
dashboard_url: https://grafana.example.com/d/api
noDataState: NoData
execErrState: Error
for: 5m
data:
- refId: A
datasourceUid: prometheus
relativeTimeRange:
from: 300
to: 0
model:
expr: |
100 *
sum(rate(http_requests_total{env="prod", service="api", status=~"5.."}[5m]))
/
clamp_min(sum(rate(http_requests_total{env="prod", service="api"}[5m])), 1)
refId: A
- refId: C
datasourceUid: __expr__
model:
type: threshold
expression: A
conditions:
- evaluator:
type: gt
params: [5]Note:
Grafana provisioning schema can vary by version.
Use exported rule JSON/YAML from your Grafana version as the source of truth.
Keep labels and annotations aligned with the standard.6. No Data And Error State#
Choose state deliberately.
| Situation | noDataState | execErrState |
|---|---|---|
| service metric absent means broken | Alerting / NoData | Error |
| sparse business event metric | OK / NoData | Error |
| datasource query failure | Error | Error |
| experimental dashboard alert | NoData | Error |
Guidance:
NoData:
should not be ignored for uptime/heartbeat alerts
Error:
usually means datasource or query problem
should route to monitoring owner7. Contact Points#
Contact points should be reusable:
pagerduty-prod:
P0/P1 prod
platform-slack:
platform team alerts
nonprod-alerts:
dev/uat alerts
ticket-webhook:
P3 or non-urgent alertsWebhook payload should include:
severity
env
service
team
alertname
state
dashboard_url
runbook_url
values8. Operations#
Checklist:
provisioning:
alert rules in repo
contact points in repo or managed with secret references
notification policies in repo
review process before prod changes
permissions:
only owners can edit production alert folders
UI edits reconciled back to code or disabled
testing:
test contact points
test route labels
test no data behavior
send resolved enabled for P0/P1
quality:
dashboard link points to same datasource/environment
runbook link exists
labels follow global standard9. Common Mistakes#
mistake:
Grafana rule created manually and not exported
result:
rule disappears or drifts during migration
mistake:
route by alert title instead of labels
result:
renaming breaks notification routing
mistake:
noDataState ignored for uptime alert
result:
missing metrics hide outages
mistake:
dashboard alert has no runbook
result:
on-call sees graph but does not know mitigation