强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

Prometheus 完全指南 / 16 - Grafana 集成

16 - Grafana 集成

16.1 概述

Grafana 是最流行的开源数据可视化平台,与 Prometheus 天然集成。本章介绍如何配置 Prometheus 数据源、创建 Dashboard、以及 Grafana 告警功能。

Grafana 核心能力

功能说明
数据源支持Prometheus、Elasticsearch、MySQL、InfluxDB 等
Dashboard丰富的可视化面板
告警Grafana 8+ 统一告警
变量动态参数化 Dashboard
注解在图表上标注事件
仪表盘共享导出/导入 JSON

16.2 安装 Grafana

Docker

docker run -d \
  --name=grafana \
  -p 3000:3000 \
  -v grafana_data:/var/lib/grafana \
  -e "GF_SECURITY_ADMIN_USER=admin" \
  -e "GF_SECURITY_ADMIN_PASSWORD=admin" \
  grafana/grafana:10.4.0

Docker Compose

services:
  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro

默认访问

  • 地址:http://localhost:3000
  • 用户名:admin
  • 密码:admin(首次登录需修改)

16.3 配置 Prometheus 数据源

通过 Web UI

  1. 登录 Grafana
  2. 左侧菜单 → Configuration → Data Sources → Add data source
  3. 选择 Prometheus
  4. 填写 URL:http://prometheus:9090
  5. 点击 Save & Test

通过 Provisioning(推荐)

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      httpMethod: POST
      timeInterval: '15s'
      queryTimeout: '60s'

  - name: Prometheus-Thanos
    type: prometheus
    access: proxy
    url: http://thanos-query:9090
    editable: true
    jsonData:
      httpMethod: POST

数据源配置参数

参数说明
urlPrometheus 地址
accessproxy(Grafana 后端代理)或 direct(浏览器直连)
httpMethodGETPOST(推荐 POST,避免 URL 过长)
timeInterval对应 scrape_interval,优化 $__interval 计算
queryTimeout查询超时时间
maxLines日志数据源的最大行数

16.4 创建 Dashboard

手动创建面板

  1. 左侧菜单 → Dashboards → New Dashboard
  2. 添加面板 → Add visualization
  3. 选择数据源
  4. 输入 PromQL 查询
  5. 配置可视化选项
  6. 保存

Provisioning 自动导入

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: 'Custom'
    orgId: 1
    folder: 'Custom'
    type: file
    disableDeletion: false
    editable: true
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards/json
      foldersFromFilesStructure: true
// grafana/provisioning/dashboards/json/node-exporter.json
{
  "dashboard": {
    "id": null,
    "uid": "node-exporter",
    "title": "Node Exporter",
    "tags": ["node", "linux"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "id": 1,
        "title": "CPU Usage",
        "type": "timeseries",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "value": null, "color": "green" },
                { "value": 80, "color": "yellow" },
                { "value": 90, "color": "red" }
              ]
            }
          }
        }
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    }
  }
}

16.5 常用 Dashboard

推荐 Dashboard

ID名称用途
1860Node Exporter Full系统指标全面监控
7362MySQL OverviewMySQL 数据库监控
11835Redis DashboardRedis 缓存监控
9628Docker MonitoringDocker 容器监控
6417Kubernetes ClusterK8s 集群监控
15661Prometheus 2.0 StatsPrometheus 自身监控

导入 Dashboard

# 通过 Grafana API 导入
curl -X POST http://admin:admin@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{
    "dashboard": {"id": 1860},
    "overwrite": true,
    "inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}]
  }'

通过 Web UI 导入

  1. 左侧菜单 → Dashboards → Import
  2. 输入 Dashboard ID(如 1860)
  3. 选择 Prometheus 数据源
  4. 点击 Import

16.6 变量(Variables)

变量让 Dashboard 变得动态和可交互。

创建变量

  1. Dashboard 设置 → Variables → Add variable
  2. 配置变量

变量类型

类型说明示例
Query从 Prometheus 查询label_values(up, job)
Custom自定义值列表production,staging,dev
Interval时间间隔$__auto_interval
Datasource数据源选择prometheus

常用变量查询

# 获取所有 job 名称
label_values(up, job)

# 获取特定 job 的实例
label_values(up{job="$job"}, instance)

# 获取所有集群名称
label_values(up, cluster)

# 获取所有状态码
label_values(http_requests_total, status)

# 获取所有命名空间
label_values(kube_pod_info, namespace)

变量引用

# 在查询中使用变量
rate(http_requests_total{job="$job"}[5m])

# 多选变量
rate(http_requests_total{job=~"$job"}[5m])

# 使用 $interval 变量
rate(http_requests_total[$interval])

16.7 Grafana 告警

Grafana 8+ 引入了统一告警系统(Unified Alerting),可以直接在 Grafana 中管理告警。

配置告警规则

  1. 左侧菜单 → Alerting → Alert rules
  2. New alert rule
  3. 配置查询和条件
  4. 设置评估间隔和通知

告警规则示例

# 通过 Provisioning 配置告警规则
# grafana/provisioning/alerting/rules.yml
apiVersion: 1

groups:
  - orgId: 1
    name: Infrastructure
    folder: Alerts
    interval: 1m
    rules:
      - uid: cpu-high
        title: High CPU Usage
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
              intervalMs: 60000
          - refId: B
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [80]
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage is above 80%"

告警通知渠道(Contact Points)

# grafana/provisioning/alerting/contact_points.yml
apiVersion: 1

contactPoints:
  - orgId: 1
    name: slack-alerts
    receivers:
      - uid: slack-1
        type: slack
        settings:
          url: https://hooks.slack.com/services/xxx/yyy/zzz
          recipient: "#alerts"
          title: '{{ template "default.title" . }}'
          text: '{{ template "default.message" . }}'

  - orgId: 1
    name: email-alerts
    receivers:
      - uid: email-1
        type: email
        settings:
          addresses: team@example.com

16.8 Dashboard 最佳实践

面板组织

Dashboard: Node Overview
├── Row: Overview
│   ├── CPU Usage (timeseries)
│   ├── Memory Usage (gauge)
│   ├── Disk Usage (gauge)
│   └── Network I/O (timeseries)
├── Row: CPU Details
│   ├── CPU Usage by Mode (timeseries)
│   └── Load Average (timeseries)
├── Row: Memory Details
│   ├── Memory Usage Breakdown (timeseries)
│   └── Swap Usage (timeseries)
└── Row: Disk Details
    ├── Disk I/O (timeseries)
    └── Disk Space by Mountpoint (timeseries)

单位规范

指标类型Grafana 单位
CPU 使用率percent (0-100)
内存bytes / decbytes
网络流量Bps / bps
请求速率reqps
延迟s / ms
磁盘 IOPSiops

阈值设置

CPU 使用率:
  < 70%  → 绿色
  70-85% → 黄色
  > 85%  → 红色

磁盘使用率:
  < 70%  → 绿色
  70-85% → 黄色
  > 85%  → 红色

错误率:
  < 1%   → 绿色
  1-5%   → 黄色
  > 5%   → 红色

16.9 本章小结

功能配置方式
数据源Provisioning YAML
DashboardJSON Provisioning / Web UI
变量label_values() 查询
告警Unified Alerting
导入ID 导入 / JSON 导入

扩展阅读


上一章15 - 容器化部署 下一章17 - 故障排查