rqlite 完全指南 / 第 12 章:监控与可观测性
第 12 章:监控与可观测性
通过状态 API、Prometheus 和 Grafana 实现 rqlite 集群的全面监控。
12.1 监控体系概览
┌──────────────────────────────────────────────────┐
│ 监控体系 │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ rqlite │ │ Prometheus │ │ Grafana │ │
│ │ 状态API │──►│ 数据采集 │──►│ 可视化 │ │
│ └──────────┘ └──────┬───────┘ └──────────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ AlertManager│ │
│ │ 告警通知 │ │
│ └────────────┘ │
└──────────────────────────────────────────────────┘
| 层级 | 组件 | 职责 |
|---|
| 数据源 | rqlite 状态 API | 暴露节点和集群指标 |
| 采集层 | Prometheus | 定期拉取指标数据 |
| 存储层 | Prometheus TSDB | 时序数据存储 |
| 展示层 | Grafana | 仪表盘可视化 |
| 告警层 | AlertManager | 阈值告警通知 |
12.2 rqlite 状态 API
12.2.1 节点状态端点
# 获取完整状态
curl -s 'localhost:4001/status?pretty' | python3 -m json.tool
关键指标字段:
{
"build": {
"branch": "master",
"commit": "abc1234",
"version": "v8.36.5"
},
"store": {
"raft_state": "Leader",
"node_id": "node1",
"db_conf": {
"fk_constraints": true,
"wal": true,
"on_disk": false
},
"num_raft_peers": 2,
"num_open_connections": 5,
"applied_index": 12345,
"commit_index": 12345,
"last_log_index": 12345,
"last_log_term": 3,
"last_snapshot_index": 12000,
"last_snapshot_term": 2,
"last_contact": "0s",
"term": 3,
"num_snaps": 5,
"db_size": 1048576
},
"runtime": {
"GOARCH": "amd64",
"GOOS": "linux",
"GOMAXPROCS": 4,
"num_cpu": 8,
"num_goroutine": 42,
"version": "go1.21.5"
}
}
12.2.2 健康检查端点
# 就绪检查(HTTP 200 = 就绪,其他 = 未就绪)
curl -s -o /dev/null -w "%{http_code}" localhost:4001/status/ready
# Leader 检查
curl -s -o /dev/null -w "%{http_code}" localhost:4001/status/leader
12.2.3 节点列表端点
curl -s 'localhost:4001/nodes?pretty'
| 字段 | 说明 |
|---|
id | 节点 ID |
api_addr | HTTP API 地址 |
addr | Raft 地址 |
voter | 是否为投票节点 |
reachable | 是否可达 |
leader | 是否为 Leader |
12.3 状态采集脚本
12.3.1 Shell 脚本采集
#!/bin/bash
# rqlite-exporter.sh — rqlite 指标采集脚本(输出 Prometheus 格式)
NODES=("localhost:4001" "localhost:4011" "localhost:4021")
echo "# HELP rqlite_raft_state Current Raft state (0=unknown, 1=follower, 2=candidate, 3=leader)"
echo "# TYPE rqlite_raft_state gauge"
echo "# HELP rqlite_applied_index Last applied Raft log index"
echo "# TYPE rqlite_applied_index gauge"
echo "# HELP rqlite_commit_index Last committed Raft log index"
echo "# TYPE rqlite_commit_index gauge"
echo "# HELP rqlite_num_peers Number of Raft peers"
echo "# TYPE rqlite_num_peers gauge"
echo "# HELP rqlite_db_size Database size in bytes"
echo "# TYPE rqlite_db_size gauge"
echo "# HELP rqlite_num_snapshots Number of snapshots"
echo "# TYPE rqlite_num_snapshots gauge"
echo "# HELP rqlite_num_open_connections Number of open connections"
echo "# TYPE rqlite_num_open_connections gauge"
echo "# HELP rqlite_up Whether the node is up (1=up, 0=down)"
echo "# TYPE rqlite_up gauge"
for node in "${NODES[@]}"; do
id=$(echo "$node" | tr ':' '_')
status=$(curl -s --connect-timeout 3 "http://$node/status" 2>/dev/null)
if [ $? -ne 0 ] || [ -z "$status" ]; then
echo "rqlite_up{node=\"$node\"} 0"
continue
fi
echo "rqlite_up{node=\"$node\"} 1"
# 解析状态
raft_state=$(echo "$status" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['store']['raft_state'])" 2>/dev/null)
case "$raft_state" in
"Leader") state_val=3 ;;
"Follower") state_val=1 ;;
"Candidate") state_val=2 ;;
*) state_val=0 ;;
esac
echo "rqlite_raft_state{node=\"$node\",state=\"$raft_state\"} $state_val"
echo "$status" | python3 -c "
import json, sys
d = json.load(sys.stdin)
s = d.get('store', {})
node = '$node'
print(f'rqlite_applied_index{{node=\"{node}\"}} {s.get(\"applied_index\", 0)}')
print(f'rqlite_commit_index{{node=\"{node}\"}} {s.get(\"commit_index\", 0)}')
print(f'rqlite_num_peers{{node=\"{node}\"}} {s.get(\"num_raft_peers\", 0)}')
print(f'rqlite_db_size{{node=\"{node}\"}} {s.get(\"db_size\", 0)}')
print(f'rqlite_num_snapshots{{node=\"{node}\"}} {s.get(\"num_snaps\", 0)}')
" 2>/dev/null
done
12.4 Prometheus 集成
12.4.1 配置 Prometheus 抓取
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# 直接抓取 rqlite 状态(需要自定义 exporter)
- job_name: 'rqlite'
static_configs:
- targets:
- 'localhost:4001'
- 'localhost:4011'
- 'localhost:4021'
labels:
cluster: 'rqlite-prod'
# 使用自定义脚本 exporter
metrics_path: /metrics
scrape_interval: 30s
12.4.2 Prometheus 告警规则
# rqlite-alerts.yml
groups:
- name: rqlite_alerts
rules:
# 节点宕机
- alert: RqliteNodeDown
expr: rqlite_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "rqlite 节点 {{ $labels.node }} 已宕机"
description: "节点 {{ $labels.node }} 已超过 1 分钟无法访问"
# 无 Leader
- alert: RqliteNoLeader
expr: count(rqlite_raft_state{state="Leader"}) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "rqlite 集群无 Leader"
description: "当前集群中没有任何 Leader 节点"
# 多个 Leader(脑裂)
- alert: RqliteMultipleLeaders
expr: count(rqlite_raft_state{state="Leader"}) > 1
for: 1m
labels:
severity: critical
annotations:
summary: "rqlite 集群存在多个 Leader"
description: "检测到 {{ $value }} 个 Leader,可能存在脑裂"
# Raft 日志落后
- alert: RqliteReplicationLag
expr: (max(rqlite_applied_index) - rqlite_applied_index) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "rqlite 节点 {{ $labels.node }} 复制落后"
description: "节点落后 {{ $value }} 条日志"
# 数据库大小增长过快
- alert: RqliteDBSizeGrowing
expr: increase(rqlite_db_size[1h]) > 104857600 # 100MB/h
for: 5m
labels:
severity: warning
annotations:
summary: "rqlite 数据库增长过快"
description: "过去 1 小时数据库增长了 {{ $value }} bytes"
12.4.3 Docker Compose 监控栈
# docker-compose-monitoring.yml
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rqlite-alerts.yml:/etc/prometheus/rqlite-alerts.yml
- prometheus-data:/prometheus
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- grafana-data:/var/lib/grafana
networks:
- monitoring
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
networks:
- monitoring
volumes:
prometheus-data:
grafana-data:
networks:
monitoring:
driver: bridge
12.5 Grafana 仪表盘
12.5.1 仪表盘 JSON 配置
{
"dashboard": {
"title": "rqlite 集群监控",
"panels": [
{
"title": "节点状态",
"type": "stat",
"targets": [{
"expr": "rqlite_up",
"legendFormat": "{{ node }}"
}],
"fieldConfig": {
"defaults": {
"mappings": [
{"type": "value", "options": {"0": {"text": "离线", "color": "red"}}},
{"type": "value", "options": {"1": {"text": "在线", "color": "green"}}}
]
}
}
},
{
"title": "Raft 角色",
"type": "stat",
"targets": [{
"expr": "rqlite_raft_state",
"legendFormat": "{{ node }} - {{ state }}"
}]
},
{
"title": "Applied Index 趋势",
"type": "timeseries",
"targets": [{
"expr": "rate(rqlite_applied_index[5m])",
"legendFormat": "{{ node }}"
}]
},
{
"title": "数据库大小",
"type": "timeseries",
"targets": [{
"expr": "rqlite_db_size",
"legendFormat": "{{ node }}"
}],
"fieldConfig": {
"defaults": {
"unit": "bytes"
}
}
}
]
}
}
12.5.2 关键监控面板
| 面板名称 | PromQL 查询 | 说明 |
|---|
| 节点在线状态 | rqlite_up | 1=在线,0=离线 |
| Leader 分布 | rqlite_raft_state{state="Leader"} | 应该只有 1 个 |
| 日志同步延迟 | max(rqlite_applied_index) - rqlite_applied_index | Follower 落后的日志数 |
| 写入速率 | rate(rqlite_applied_index[5m]) | 每秒应用的日志数 |
| 数据库大小 | rqlite_db_size | 当前数据库文件大小 |
| 快照数量 | rqlite_num_snapshots | 累计快照数 |
| 连接数 | rqlite_num_open_connections | 当前 HTTP 连接数 |
12.6 日志管理
12.6.1 rqlite 日志级别
rqlite 使用 Go 标准日志库,日志输出到 stderr:
# Docker 日志查看
docker logs rqlite1 --tail 100 -f
# systemd 日志查看
journalctl -u rqlited -f
# 日志级别(rqlite 目前不支持动态调整日志级别)
12.6.2 关键日志模式
| 日志模式 | 含义 | 是否需要关注 |
|---|
node is ready | 节点就绪 | 正常 |
RAFT: entering Leader state | 成为 Leader | 正常 |
RAFT: entering Follower state | 成为 Follower | 正常 |
RAFT: no known peers, starting as leader | 首次启动 | 正常 |
RAFT: election timeout reached | 选举超时 | ⚠️ 关注 |
failed to connect to | 连接失败 | ❌ 需排查 |
snapshot started | 开始快照 | 正常 |
snapshot complete | 快照完成 | 正常 |
12.6.3 集中日志收集
# Promtail 配置(Loki 日志收集)
scrape_configs:
- job_name: rqlite
docker_sd_configs:
- filters:
- name: label
values: ["app=rqlite"]
pipeline_stages:
- regex:
expression: '\[(?P<component>\w+)\] (?P<level>\w+): (?P<message>.*)'
- labels:
component:
level:
12.7 业务场景:运维告警策略
| 告警级别 | 条件 | 响应时间 | 处理方式 |
|---|
| P0 紧急 | 集群无 Leader | 5 分钟 | 立即排查,重启故障节点 |
| P0 紧急 | 全部节点宕机 | 1 分钟 | 立即响应,恢复服务 |
| P1 严重 | 单节点宕机 | 15 分钟 | 排查并恢复节点 |
| P1 严重 | 检测到脑裂 | 5 分钟 | 停止写入,排查网络 |
| P2 警告 | 复制延迟 > 1000 | 1 小时 | 检查网络和磁盘 |
| P2 警告 | 数据库大小增长异常 | 1 小时 | 检查数据增长原因 |
| P3 通知 | 节点重启 | 当日 | 记录并观察 |
12.8 监控检查清单
| 检查项 | 采集频率 | 告警阈值 |
|---|
| 节点在线状态 | 30s | 连续 3 次失败 |
| Leader 存在性 | 30s | 0 个 Leader |
| 复制延迟 | 1min | > 1000 条日志 |
| 数据库大小 | 5min | 增长率异常 |
| 磁盘使用率 | 1min | > 85% |
| HTTP 连接数 | 1min | > 1000 |
| Goroutine 数 | 5min | > 1000 |
| 响应延迟 | 30s | > 5s |
12.9 本章小结
| 要点 | 内容 |
|---|
| 状态 API | /status、/nodes、/status/ready、/status/leader |
| Prometheus | 通过自定义 exporter 采集指标 |
| Grafana | 可视化集群状态和趋势 |
| 告警策略 | 分级告警,P0-P3 四级 |
| 日志管理 | 集中收集,关注关键日志模式 |
上一章:第 11 章:容器化部署
下一章:第 13 章:故障排查