强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

Prometheus 完全指南 / 16 - Grafana 集成

16 - Grafana 集成

16.1 概述

Grafana 是最流行的开源数据可视化平台,与 Prometheus 天然集成。本章介绍如何配置 Prometheus 数据源、创建 Dashboard、以及 Grafana 告警功能。

Grafana 核心能力

功能 说明
数据源支持 Prometheus、Elasticsearch、MySQL、InfluxDB 等
Dashboard 丰富的可视化面板
告警 Grafana 8+ 统一告警
变量 动态参数化 Dashboard
注解 在图表上标注事件
仪表盘共享 导出/导入 JSON

16.2 安装 Grafana

Docker

docker run -d \
  --name=grafana \
  -p 3000:3000 \
  -v grafana_data:/var/lib/grafana \
  -e "GF_SECURITY_ADMIN_USER=admin" \
  -e "GF_SECURITY_ADMIN_PASSWORD=admin" \
  grafana/grafana:10.4.0

Docker Compose

services:
  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro

默认访问

  • 地址:http://localhost:3000
  • 用户名:admin
  • 密码:admin(首次登录需修改)

16.3 配置 Prometheus 数据源

通过 Web UI

  1. 登录 Grafana
  2. 左侧菜单 → Configuration → Data Sources → Add data source
  3. 选择 Prometheus
  4. 填写 URL:http://prometheus:9090
  5. 点击 Save & Test

通过 Provisioning(推荐)

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      httpMethod: POST
      timeInterval: '15s'
      queryTimeout: '60s'

  - name: Prometheus-Thanos
    type: prometheus
    access: proxy
    url: http://thanos-query:9090
    editable: true
    jsonData:
      httpMethod: POST

数据源配置参数

参数 说明
url Prometheus 地址
access proxy(Grafana 后端代理)或 direct(浏览器直连)
httpMethod GETPOST(推荐 POST,避免 URL 过长)
timeInterval 对应 scrape_interval,优化 $__interval 计算
queryTimeout 查询超时时间
maxLines 日志数据源的最大行数

16.4 创建 Dashboard

手动创建面板

  1. 左侧菜单 → Dashboards → New Dashboard
  2. 添加面板 → Add visualization
  3. 选择数据源
  4. 输入 PromQL 查询
  5. 配置可视化选项
  6. 保存

Provisioning 自动导入

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: 'Custom'
    orgId: 1
    folder: 'Custom'
    type: file
    disableDeletion: false
    editable: true
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards/json
      foldersFromFilesStructure: true
// grafana/provisioning/dashboards/json/node-exporter.json
{
  "dashboard": {
    "id": null,
    "uid": "node-exporter",
    "title": "Node Exporter",
    "tags": ["node", "linux"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "id": 1,
        "title": "CPU Usage",
        "type": "timeseries",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "value": null, "color": "green" },
                { "value": 80, "color": "yellow" },
                { "value": 90, "color": "red" }
              ]
            }
          }
        }
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    }
  }
}

16.5 常用 Dashboard

推荐 Dashboard

ID 名称 用途
1860 Node Exporter Full 系统指标全面监控
7362 MySQL Overview MySQL 数据库监控
11835 Redis Dashboard Redis 缓存监控
9628 Docker Monitoring Docker 容器监控
6417 Kubernetes Cluster K8s 集群监控
15661 Prometheus 2.0 Stats Prometheus 自身监控

导入 Dashboard

# 通过 Grafana API 导入
curl -X POST http://admin:admin@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{
    "dashboard": {"id": 1860},
    "overwrite": true,
    "inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}]
  }'

通过 Web UI 导入

  1. 左侧菜单 → Dashboards → Import
  2. 输入 Dashboard ID(如 1860)
  3. 选择 Prometheus 数据源
  4. 点击 Import

16.6 变量(Variables)

变量让 Dashboard 变得动态和可交互。

创建变量

  1. Dashboard 设置 → Variables → Add variable
  2. 配置变量

变量类型

类型 说明 示例
Query 从 Prometheus 查询 label_values(up, job)
Custom 自定义值列表 production,staging,dev
Interval 时间间隔 $__auto_interval
Datasource 数据源选择 prometheus

常用变量查询

# 获取所有 job 名称
label_values(up, job)

# 获取特定 job 的实例
label_values(up{job="$job"}, instance)

# 获取所有集群名称
label_values(up, cluster)

# 获取所有状态码
label_values(http_requests_total, status)

# 获取所有命名空间
label_values(kube_pod_info, namespace)

变量引用

# 在查询中使用变量
rate(http_requests_total{job="$job"}[5m])

# 多选变量
rate(http_requests_total{job=~"$job"}[5m])

# 使用 $interval 变量
rate(http_requests_total[$interval])

16.7 Grafana 告警

Grafana 8+ 引入了统一告警系统(Unified Alerting),可以直接在 Grafana 中管理告警。

配置告警规则

  1. 左侧菜单 → Alerting → Alert rules
  2. New alert rule
  3. 配置查询和条件
  4. 设置评估间隔和通知

告警规则示例

# 通过 Provisioning 配置告警规则
# grafana/provisioning/alerting/rules.yml
apiVersion: 1

groups:
  - orgId: 1
    name: Infrastructure
    folder: Alerts
    interval: 1m
    rules:
      - uid: cpu-high
        title: High CPU Usage
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
              intervalMs: 60000
          - refId: B
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [80]
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage is above 80%"

告警通知渠道(Contact Points)

# grafana/provisioning/alerting/contact_points.yml
apiVersion: 1

contactPoints:
  - orgId: 1
    name: slack-alerts
    receivers:
      - uid: slack-1
        type: slack
        settings:
          url: https://hooks.slack.com/services/xxx/yyy/zzz
          recipient: "#alerts"
          title: '{{ template "default.title" . }}'
          text: '{{ template "default.message" . }}'

  - orgId: 1
    name: email-alerts
    receivers:
      - uid: email-1
        type: email
        settings:
          addresses: team@example.com

16.8 Dashboard 最佳实践

面板组织

Dashboard: Node Overview
├── Row: Overview
│   ├── CPU Usage (timeseries)
│   ├── Memory Usage (gauge)
│   ├── Disk Usage (gauge)
│   └── Network I/O (timeseries)
├── Row: CPU Details
│   ├── CPU Usage by Mode (timeseries)
│   └── Load Average (timeseries)
├── Row: Memory Details
│   ├── Memory Usage Breakdown (timeseries)
│   └── Swap Usage (timeseries)
└── Row: Disk Details
    ├── Disk I/O (timeseries)
    └── Disk Space by Mountpoint (timeseries)

单位规范

指标类型 Grafana 单位
CPU 使用率 percent (0-100)
内存 bytes / decbytes
网络流量 Bps / bps
请求速率 reqps
延迟 s / ms
磁盘 IOPS iops

阈值设置

CPU 使用率:
  < 70%  → 绿色
  70-85% → 黄色
  > 85%  → 红色

磁盘使用率:
  < 70%  → 绿色
  70-85% → 黄色
  > 85%  → 红色

错误率:
  < 1%   → 绿色
  1-5%   → 黄色
  > 5%   → 红色

16.9 本章小结

功能 配置方式
数据源 Provisioning YAML
Dashboard JSON Provisioning / Web UI
变量 label_values() 查询
告警 Unified Alerting
导入 ID 导入 / JSON 导入

扩展阅读


上一章15 - 容器化部署 下一章17 - 故障排查