Grafana Alerting

Links#

https://grafana.com/docs/grafana/latest/alerting/
https://grafana.com/docs/grafana/latest/alerting/fundamentals/alert-rules/
https://grafana.com/docs/grafana/latest/alerting/fundamentals/notification-policies/
https://grafana.com/docs/grafana/latest/alerting/fundamentals/contact-points/
https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/

1. Important Points#

Grafana Alerting is useful when teams already manage dashboards in Grafana and need alerts from multiple data sources. It can alert on Prometheus, Loki, CloudWatch, SQL data sources and more.

Grafana Alerting 用来做:
    dashboard-adjacent alerting
    multi-datasource alerts
    contact points
    notification policies
    UI-managed or provisioned alert rules

Grafana Alerting 不适合:
    replacing Prometheus/vmalert for very large PromQL rule sets
    uncontrolled click-ops in production
    alerts without provisioning / review process

Recommended split:

Prometheus / vmalert:
    infrastructure and service metric rules at scale

Grafana Alerting:
    dashboard-owned alerts
    multi-source alerts
    small team-managed alerts
    visual alert review

2. Core Concepts#

Concept	Meaning
Alert rule	condition evaluated by Grafana
Folder	rule organization and permissions
Evaluation group	rule evaluation interval grouping
Contact point	Slack, Teams, PagerDuty, webhook, email
Notification policy	routing tree
Label	routing/grouping metadata
Annotation	human-readable context
Silence	temporary mute

3. Label Standard#

Use the same labels as the global alerting standard:

severity
env
service
team
resource_type
resource

Example labels:

severity: P1
env: prod
service: api
team: platform
resource_type: service
resource: api

Annotations:

summary: API 5xx rate is high
description: API 5xx rate is above 5% for 5 minutes.
dashboard_url: https://grafana.example.com/d/api
runbook_url: https://docs.example.com/runbooks/api-5xx

4. Notification Policies#

Recommended policy tree:

root policy:
    default contact point = platform-warning

routes:
    severity=P0|P1, env=prod
        -> pagerduty-prod
        -> continue to team channel

    team=platform
        -> platform-slack

    env=dev|uat
        -> nonprod-alerts

Principles:

route by labels, not alert names
send resolved notifications for P0/P1
avoid one contact point per alert
use mute timings for planned maintenance
use folders and provisioning for ownership

5. Provisioning Pattern#

Keep production alerting in code.

grafana/
├── provisioning/
│   └── alerting/
│       ├── contact-points.yaml
│       ├── notification-policies.yaml
│       └── rules-api.yaml

Example rule provisioning shape:

apiVersion: 1
groups:
  - orgId: 1
    name: api.rules
    folder: Production
    interval: 1m
    rules:
      - uid: p1-prod-api-high-5xx
        title: P1-prod-api-service-api-High5xxRate
        condition: C
        labels:
          severity: P1
          env: prod
          service: api
          team: platform
          resource_type: service
          resource: api
        annotations:
          summary: API 5xx rate is high
          runbook_url: https://docs.example.com/runbooks/api-5xx
          dashboard_url: https://grafana.example.com/d/api
        noDataState: NoData
        execErrState: Error
        for: 5m
        data:
          - refId: A
            datasourceUid: prometheus
            relativeTimeRange:
              from: 300
              to: 0
            model:
              expr: |
                100 *
                sum(rate(http_requests_total{env="prod", service="api", status=~"5.."}[5m]))
                /
                clamp_min(sum(rate(http_requests_total{env="prod", service="api"}[5m])), 1)
              refId: A
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: A
              conditions:
                - evaluator:
                    type: gt
                    params: [5]

Note:

Grafana provisioning schema can vary by version.
Use exported rule JSON/YAML from your Grafana version as the source of truth.
Keep labels and annotations aligned with the standard.

6. No Data And Error State#

Choose state deliberately.

Situation	noDataState	execErrState
service metric absent means broken	Alerting / NoData	Error
sparse business event metric	OK / NoData	Error
datasource query failure	Error	Error
experimental dashboard alert	NoData	Error

Guidance:

NoData:
    should not be ignored for uptime/heartbeat alerts

Error:
    usually means datasource or query problem
    should route to monitoring owner

7. Contact Points#

Contact points should be reusable:

pagerduty-prod:
    P0/P1 prod

platform-slack:
    platform team alerts

nonprod-alerts:
    dev/uat alerts

ticket-webhook:
    P3 or non-urgent alerts

Webhook payload should include:

severity
env
service
team
alertname
state
dashboard_url
runbook_url
values

8. Operations#

Checklist:

provisioning:
    alert rules in repo
    contact points in repo or managed with secret references
    notification policies in repo
    review process before prod changes

permissions:
    only owners can edit production alert folders
    UI edits reconciled back to code or disabled

testing:
    test contact points
    test route labels
    test no data behavior
    send resolved enabled for P0/P1

quality:
    dashboard link points to same datasource/environment
    runbook link exists
    labels follow global standard

9. Common Mistakes#

mistake:
    Grafana rule created manually and not exported
result:
    rule disappears or drifts during migration

mistake:
    route by alert title instead of labels
result:
    renaming breaks notification routing

mistake:
    noDataState ignored for uptime alert
result:
    missing metrics hide outages

mistake:
    dashboard alert has no runbook
result:
    on-call sees graph but does not know mitigation