Prometheus 完全指南 / 16 - Grafana 集成

16 - Grafana 集成

16.1 概述

Grafana 是最流行的开源数据可视化平台，与 Prometheus 天然集成。本章介绍如何配置 Prometheus 数据源、创建 Dashboard、以及 Grafana 告警功能。

Grafana 核心能力

功能	说明
数据源支持	Prometheus、Elasticsearch、MySQL、InfluxDB 等
Dashboard	丰富的可视化面板
告警	Grafana 8+ 统一告警
变量	动态参数化 Dashboard
注解	在图表上标注事件
仪表盘共享	导出/导入 JSON

16.2 安装 Grafana

Docker

docker run -d \
  --name=grafana \
  -p 3000:3000 \
  -v grafana_data:/var/lib/grafana \
  -e "GF_SECURITY_ADMIN_USER=admin" \
  -e "GF_SECURITY_ADMIN_PASSWORD=admin" \
  grafana/grafana:10.4.0

Docker Compose

services:
  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro

默认访问

地址：http://localhost:3000
用户名：admin
密码：admin（首次登录需修改）

16.3 配置 Prometheus 数据源

通过 Web UI

登录 Grafana
左侧菜单 → Configuration → Data Sources → Add data source
选择 Prometheus
填写 URL：http://prometheus:9090
点击 Save & Test

通过 Provisioning（推荐）

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      httpMethod: POST
      timeInterval: '15s'
      queryTimeout: '60s'

  - name: Prometheus-Thanos
    type: prometheus
    access: proxy
    url: http://thanos-query:9090
    editable: true
    jsonData:
      httpMethod: POST

数据源配置参数

参数	说明
`url`	Prometheus 地址
`access`	`proxy`（Grafana 后端代理）或 `direct`（浏览器直连）
`httpMethod`	`GET` 或 `POST`（推荐 POST，避免 URL 过长）
`timeInterval`	对应 scrape_interval，优化 `$__interval` 计算
`queryTimeout`	查询超时时间
`maxLines`	日志数据源的最大行数

16.4 创建 Dashboard

手动创建面板

左侧菜单 → Dashboards → New Dashboard
添加面板 → Add visualization
选择数据源
输入 PromQL 查询
配置可视化选项
保存

Provisioning 自动导入

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: 'Custom'
    orgId: 1
    folder: 'Custom'
    type: file
    disableDeletion: false
    editable: true
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards/json
      foldersFromFilesStructure: true

// grafana/provisioning/dashboards/json/node-exporter.json
{
  "dashboard": {
    "id": null,
    "uid": "node-exporter",
    "title": "Node Exporter",
    "tags": ["node", "linux"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "id": 1,
        "title": "CPU Usage",
        "type": "timeseries",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "value": null, "color": "green" },
                { "value": 80, "color": "yellow" },
                { "value": 90, "color": "red" }
              ]
            }
          }
        }
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    }
  }
}

16.5 常用 Dashboard

ID	名称	用途
1860	Node Exporter Full	系统指标全面监控
7362	MySQL Overview	MySQL 数据库监控
11835	Redis Dashboard	Redis 缓存监控
9628	Docker Monitoring	Docker 容器监控
6417	Kubernetes Cluster	K8s 集群监控
15661	Prometheus 2.0 Stats	Prometheus 自身监控

导入 Dashboard

# 通过 Grafana API 导入
curl -X POST http://admin:admin@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{
    "dashboard": {"id": 1860},
    "overwrite": true,
    "inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}]
  }'

通过 Web UI 导入

左侧菜单 → Dashboards → Import
输入 Dashboard ID（如 1860）
选择 Prometheus 数据源
点击 Import

16.6 变量（Variables）

变量让 Dashboard 变得动态和可交互。

创建变量

Dashboard 设置 → Variables → Add variable
配置变量

变量类型

类型	说明	示例
Query	从 Prometheus 查询	`label_values(up, job)`
Custom	自定义值列表	`production,staging,dev`
Interval	时间间隔	`$__auto_interval`
Datasource	数据源选择	`prometheus`

常用变量查询

# 获取所有 job 名称
label_values(up, job)

# 获取特定 job 的实例
label_values(up{job="$job"}, instance)

# 获取所有集群名称
label_values(up, cluster)

# 获取所有状态码
label_values(http_requests_total, status)

# 获取所有命名空间
label_values(kube_pod_info, namespace)

变量引用

# 在查询中使用变量
rate(http_requests_total{job="$job"}[5m])

# 多选变量
rate(http_requests_total{job=~"$job"}[5m])

# 使用 $interval 变量
rate(http_requests_total[$interval])

16.7 Grafana 告警

Grafana 8+ 引入了统一告警系统（Unified Alerting），可以直接在 Grafana 中管理告警。

配置告警规则

左侧菜单 → Alerting → Alert rules
New alert rule
配置查询和条件
设置评估间隔和通知

告警规则示例

# 通过 Provisioning 配置告警规则
# grafana/provisioning/alerting/rules.yml
apiVersion: 1

groups:
  - orgId: 1
    name: Infrastructure
    folder: Alerts
    interval: 1m
    rules:
      - uid: cpu-high
        title: High CPU Usage
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
              intervalMs: 60000
          - refId: B
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [80]
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage is above 80%"

告警通知渠道（Contact Points）

# grafana/provisioning/alerting/contact_points.yml
apiVersion: 1

contactPoints:
  - orgId: 1
    name: slack-alerts
    receivers:
      - uid: slack-1
        type: slack
        settings:
          url: https://hooks.slack.com/services/xxx/yyy/zzz
          recipient: "#alerts"
          title: '{{ template "default.title" . }}'
          text: '{{ template "default.message" . }}'

  - orgId: 1
    name: email-alerts
    receivers:
      - uid: email-1
        type: email
        settings:
          addresses: team@example.com

16.8 Dashboard 最佳实践

面板组织

Dashboard: Node Overview
├── Row: Overview
│   ├── CPU Usage (timeseries)
│   ├── Memory Usage (gauge)
│   ├── Disk Usage (gauge)
│   └── Network I/O (timeseries)
├── Row: CPU Details
│   ├── CPU Usage by Mode (timeseries)
│   └── Load Average (timeseries)
├── Row: Memory Details
│   ├── Memory Usage Breakdown (timeseries)
│   └── Swap Usage (timeseries)
└── Row: Disk Details
    ├── Disk I/O (timeseries)
    └── Disk Space by Mountpoint (timeseries)

单位规范

指标类型	Grafana 单位
CPU 使用率	`percent (0-100)`
内存	`bytes` / `decbytes`
网络流量	`Bps` / `bps`
请求速率	`reqps`
延迟	`s` / `ms`
磁盘 IOPS	`iops`

阈值设置

CPU 使用率:
  < 70%  → 绿色
  70-85% → 黄色
  > 85%  → 红色

磁盘使用率:
  < 70%  → 绿色
  70-85% → 黄色
  > 85%  → 红色

错误率:
  < 1%   → 绿色
  1-5%   → 黄色
  > 5%   → 红色

16.9 本章小结

功能	配置方式
数据源	Provisioning YAML
Dashboard	JSON Provisioning / Web UI
变量	`label_values()` 查询
告警	Unified Alerting
导入	ID 导入 / JSON 导入

扩展阅读

上一章：15 - 容器化部署 下一章：17 - 故障排查

Prometheus 完全指南 / 16 - Grafana 集成

16 - Grafana 集成

16.1 概述

Grafana 核心能力

16.2 安装 Grafana

Docker

Docker Compose

默认访问

16.3 配置 Prometheus 数据源

通过 Web UI

通过 Provisioning（推荐）

数据源配置参数

16.4 创建 Dashboard

手动创建面板

Provisioning 自动导入

16.5 常用 Dashboard

推荐 Dashboard

导入 Dashboard

通过 Web UI 导入

16.6 变量（Variables）

创建变量

变量类型

常用变量查询

变量引用

16.7 Grafana 告警

配置告警规则

告警规则示例

告警通知渠道（Contact Points）

16.8 Dashboard 最佳实践

面板组织

单位规范

阈值设置

16.9 本章小结

扩展阅读