Prometheus 完全指南 / 16 - Grafana 集成
16 - Grafana 集成
16.1 概述
Grafana 是最流行的开源数据可视化平台,与 Prometheus 天然集成。本章介绍如何配置 Prometheus 数据源、创建 Dashboard、以及 Grafana 告警功能。
Grafana 核心能力
| 功能 | 说明 |
|---|
| 数据源支持 | Prometheus、Elasticsearch、MySQL、InfluxDB 等 |
| Dashboard | 丰富的可视化面板 |
| 告警 | Grafana 8+ 统一告警 |
| 变量 | 动态参数化 Dashboard |
| 注解 | 在图表上标注事件 |
| 仪表盘共享 | 导出/导入 JSON |
16.2 安装 Grafana
Docker
docker run -d \
--name=grafana \
-p 3000:3000 \
-v grafana_data:/var/lib/grafana \
-e "GF_SECURITY_ADMIN_USER=admin" \
-e "GF_SECURITY_ADMIN_PASSWORD=admin" \
grafana/grafana:10.4.0
Docker Compose
services:
grafana:
image: grafana/grafana:10.4.0
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=https://grafana.example.com
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
默认访问
- 地址:
http://localhost:3000 - 用户名:
admin - 密码:
admin(首次登录需修改)
16.3 配置 Prometheus 数据源
通过 Web UI
- 登录 Grafana
- 左侧菜单 → Configuration → Data Sources → Add data source
- 选择 Prometheus
- 填写 URL:
http://prometheus:9090 - 点击 Save & Test
通过 Provisioning(推荐)
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
httpMethod: POST
timeInterval: '15s'
queryTimeout: '60s'
- name: Prometheus-Thanos
type: prometheus
access: proxy
url: http://thanos-query:9090
editable: true
jsonData:
httpMethod: POST
数据源配置参数
| 参数 | 说明 |
|---|
url | Prometheus 地址 |
access | proxy(Grafana 后端代理)或 direct(浏览器直连) |
httpMethod | GET 或 POST(推荐 POST,避免 URL 过长) |
timeInterval | 对应 scrape_interval,优化 $__interval 计算 |
queryTimeout | 查询超时时间 |
maxLines | 日志数据源的最大行数 |
16.4 创建 Dashboard
手动创建面板
- 左侧菜单 → Dashboards → New Dashboard
- 添加面板 → Add visualization
- 选择数据源
- 输入 PromQL 查询
- 配置可视化选项
- 保存
Provisioning 自动导入
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'Custom'
orgId: 1
folder: 'Custom'
type: file
disableDeletion: false
editable: true
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards/json
foldersFromFilesStructure: true
// grafana/provisioning/dashboards/json/node-exporter.json
{
"dashboard": {
"id": null,
"uid": "node-exporter",
"title": "Node Exporter",
"tags": ["node", "linux"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"id": 1,
"title": "CPU Usage",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "value": null, "color": "green" },
{ "value": 80, "color": "yellow" },
{ "value": 90, "color": "red" }
]
}
}
}
}
],
"time": {
"from": "now-1h",
"to": "now"
}
}
}
16.5 常用 Dashboard
推荐 Dashboard
| ID | 名称 | 用途 |
|---|
| 1860 | Node Exporter Full | 系统指标全面监控 |
| 7362 | MySQL Overview | MySQL 数据库监控 |
| 11835 | Redis Dashboard | Redis 缓存监控 |
| 9628 | Docker Monitoring | Docker 容器监控 |
| 6417 | Kubernetes Cluster | K8s 集群监控 |
| 15661 | Prometheus 2.0 Stats | Prometheus 自身监控 |
导入 Dashboard
# 通过 Grafana API 导入
curl -X POST http://admin:admin@localhost:3000/api/dashboards/import \
-H "Content-Type: application/json" \
-d '{
"dashboard": {"id": 1860},
"overwrite": true,
"inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}]
}'
通过 Web UI 导入
- 左侧菜单 → Dashboards → Import
- 输入 Dashboard ID(如 1860)
- 选择 Prometheus 数据源
- 点击 Import
16.6 变量(Variables)
变量让 Dashboard 变得动态和可交互。
创建变量
- Dashboard 设置 → Variables → Add variable
- 配置变量
变量类型
| 类型 | 说明 | 示例 |
|---|
| Query | 从 Prometheus 查询 | label_values(up, job) |
| Custom | 自定义值列表 | production,staging,dev |
| Interval | 时间间隔 | $__auto_interval |
| Datasource | 数据源选择 | prometheus |
常用变量查询
# 获取所有 job 名称
label_values(up, job)
# 获取特定 job 的实例
label_values(up{job="$job"}, instance)
# 获取所有集群名称
label_values(up, cluster)
# 获取所有状态码
label_values(http_requests_total, status)
# 获取所有命名空间
label_values(kube_pod_info, namespace)
变量引用
# 在查询中使用变量
rate(http_requests_total{job="$job"}[5m])
# 多选变量
rate(http_requests_total{job=~"$job"}[5m])
# 使用 $interval 变量
rate(http_requests_total[$interval])
16.7 Grafana 告警
Grafana 8+ 引入了统一告警系统(Unified Alerting),可以直接在 Grafana 中管理告警。
配置告警规则
- 左侧菜单 → Alerting → Alert rules
- New alert rule
- 配置查询和条件
- 设置评估间隔和通知
告警规则示例
# 通过 Provisioning 配置告警规则
# grafana/provisioning/alerting/rules.yml
apiVersion: 1
groups:
- orgId: 1
name: Infrastructure
folder: Alerts
interval: 1m
rules:
- uid: cpu-high
title: High CPU Usage
condition: C
data:
- refId: A
datasourceUid: prometheus
model:
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
intervalMs: 60000
- refId: B
datasourceUid: __expr__
model:
type: reduce
expression: A
reducer: last
- refId: C
datasourceUid: __expr__
model:
type: threshold
expression: B
conditions:
- evaluator:
type: gt
params: [80]
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage is above 80%"
# grafana/provisioning/alerting/contact_points.yml
apiVersion: 1
contactPoints:
- orgId: 1
name: slack-alerts
receivers:
- uid: slack-1
type: slack
settings:
url: https://hooks.slack.com/services/xxx/yyy/zzz
recipient: "#alerts"
title: '{{ template "default.title" . }}'
text: '{{ template "default.message" . }}'
- orgId: 1
name: email-alerts
receivers:
- uid: email-1
type: email
settings:
addresses: team@example.com
16.8 Dashboard 最佳实践
面板组织
Dashboard: Node Overview
├── Row: Overview
│ ├── CPU Usage (timeseries)
│ ├── Memory Usage (gauge)
│ ├── Disk Usage (gauge)
│ └── Network I/O (timeseries)
├── Row: CPU Details
│ ├── CPU Usage by Mode (timeseries)
│ └── Load Average (timeseries)
├── Row: Memory Details
│ ├── Memory Usage Breakdown (timeseries)
│ └── Swap Usage (timeseries)
└── Row: Disk Details
├── Disk I/O (timeseries)
└── Disk Space by Mountpoint (timeseries)
单位规范
| 指标类型 | Grafana 单位 |
|---|
| CPU 使用率 | percent (0-100) |
| 内存 | bytes / decbytes |
| 网络流量 | Bps / bps |
| 请求速率 | reqps |
| 延迟 | s / ms |
| 磁盘 IOPS | iops |
阈值设置
CPU 使用率:
< 70% → 绿色
70-85% → 黄色
> 85% → 红色
磁盘使用率:
< 70% → 绿色
70-85% → 黄色
> 85% → 红色
错误率:
< 1% → 绿色
1-5% → 黄色
> 5% → 红色
16.9 本章小结
| 功能 | 配置方式 |
|---|
| 数据源 | Provisioning YAML |
| Dashboard | JSON Provisioning / Web UI |
| 变量 | label_values() 查询 |
| 告警 | Unified Alerting |
| 导入 | ID 导入 / JSON 导入 |
扩展阅读
上一章:15 - 容器化部署
下一章:17 - 故障排查