VictoriaMetrics 完全指南 / 08 - 告警配置
08 · 告警配置
本章目标
- 了解 vmalert 的架构与工作原理
- 掌握告警规则(Alerting Rules)的编写
- 配置 vmalert 与 Alertmanager 集成
- 学会使用 Recording Rules 优化查询性能
- 掌握告警测试与调试技巧
8.1 vmalert 简介
vmalert 是 VictoriaMetrics 提供的告警引擎,功能类似于 Prometheus 的 alerting/recording rules 评估器。
┌──────────────────────────────────────────────┐
│ vmalert │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Alerting Rules │ │ Recording Rules │ │
│ │ (告警规则) │ │ (记录规则) │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │
│ 定期评估 查询 MetricsQL 写入结果到 VM │
└───────────┼──────────────────────┼────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Alertmanager │ │VictoriaMetrics│
│ (告警路由) │ │ (存储结果) │
└──────┬───────┘ └──────────────┘
│
┌─────┼─────┐
▼ ▼ ▼
Email Slack 钉钉
vmalert vs Prometheus Alertmanager
| 特性 | Prometheus 内置 | vmalert |
|---|
| 集成方式 | 内嵌于 Prometheus | 独立进程 |
| 查询引擎 | Prometheus | VictoriaMetrics |
| 租户支持 | ❌ | ✅ (集群版) |
| 外部标签 | 有限 | 完整支持 |
| Recording Rules | ✅ | ✅ |
| 回填支持 | 需 promtool | 原生支持 |
8.2 安装与启动
8.2.1 下载安装
VM_VERSION="v1.106.0"
curl -LO "https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/${VM_VERSION}/vmalert-linux-amd64-${VM_VERSION}.tar.gz"
tar xzf "vmalert-linux-amd64-${VM_VERSION}.tar.gz"
sudo mv vmalert-prod /usr/local/bin/vmalert
chmod +x /usr/local/bin/vmalert
8.2.2 基础启动
vmalert \
-rule=/etc/vmalert/rules/*.yml \
-datasource.url=http://localhost:8428 \
-notifier.url=http://alertmanager:9093 \
-external.label=env=prod \
-external.label=region=cn-north \
-evaluationInterval=15s \
-httpListenAddr=:8880
8.2.3 systemd 服务
# /etc/systemd/system/vmalert.service
[Unit]
Description=VictoriaMetrics Alert Engine
After=network.target victoria-metrics.service
[Service]
Type=simple
User=victoriametrics
Group=victoriametrics
ExecStart=/usr/local/bin/vmalert \
-rule=/etc/vmalert/rules/*.yml \
-datasource.url=http://localhost:8428 \
-notifier.url=http://localhost:9093 \
-external.label=env=prod \
-evaluationInterval=15s \
-httpListenAddr=:8880
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
8.3 告警规则编写
8.3.1 规则文件格式
# /etc/vmalert/rules/infra.yml
groups:
- name: infrastructure
interval: 30s # 评估间隔(可选,覆盖全局)
concurrency: 2 # 并发评估数(可选)
rules:
- alert: HighCPU
expr: avg by (host) (cpu_usage) > 80
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "主机 {{ $labels.host }} CPU 使用率过高"
description: "当前 CPU 使用率: {{ $value | printf \"%.1f\" }}%"
8.3.2 规则字段详解
| 字段 | 必填 | 说明 |
|---|
alert | 是 | 告警名称 |
expr | 是 | MetricsQL 查询表达式 |
for | 否 | 持续触发多久后才发送告警 |
labels | 否 | 附加到告警上的标签 |
annotations | 否 | 告警的描述信息(支持模板) |
keep_firing_for | 否 | 数据消失后保持触发的时长 |
8.3.3 模板变量
在 annotations 中可以使用以下模板变量:
| 变量 | 说明 | 示例 |
|---|
{{ $value }} | 查询结果值 | {{ $value }} |
{{ $labels.xxx }} | 标签值 | {{ $labels.host }} |
{{ $externalLabels.xxx }} | 外部标签 | {{ $externalLabels.env }} |
{{ $alertName }} | 告警名称 | {{ $alertName }} |
8.4 常用告警规则
8.4.1 基础设施告警
groups:
- name: infrastructure-alerts
rules:
# 主机存活
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "实例 {{ $labels.instance }} 不可达"
# CPU 使用率
- alert: HighCPUUsage
expr: avg by (host) (cpu_usage) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.host }} CPU 使用率过高"
description: "当前值: {{ $value | printf \"%.1f\" }}%"
# 内存使用率
- alert: HighMemoryUsage
expr: memory_usage > 90
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.host }} 内存使用率超过 90%"
# 磁盘空间
- alert: DiskSpaceRunningOut
expr: predict_linear(disk_usage[7d], 3600*24*7) > 95
for: 1h
labels:
severity: warning
annotations:
summary: "{{ $labels.host }} 磁盘预计 7 天内将满"
# 磁盘空间紧急
- alert: DiskSpaceCritical
expr: disk_usage > 95
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.host }} 磁盘使用率超过 95%"
8.4.2 应用告警
groups:
- name: application-alerts
rules:
# HTTP 错误率
- alert: HighErrorRate
expr: |
100 * sum by (job) (
rate(http_requests_total{status=~"5.."}[5m])
) / sum by (job) (
rate(http_requests_total[5m])
) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} HTTP 5xx 错误率超过 5%"
description: "当前错误率: {{ $value | printf \"%.2f\" }}%"
# 请求延迟 P99
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum by (le, job) (rate(http_duration_bucket[5m]))
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.job }} P99 延迟超过 1 秒"
description: "当前 P99: {{ $value | printf \"%.3f\" }}s"
# 服务不可用
- alert: ServiceDown
expr: absent(up{job="api-server"} == 1)
for: 1m
labels:
severity: critical
annotations:
summary: "api-server 服务不可用"
8.4.3 VictoriaMetrics 自身告警
groups:
- name: vm-alerts
rules:
# 写入速率下降
- alert: VMInsertRateDropped
expr: |
rate(vm_rows_inserted_total[5m]) <
rate(vm_rows_inserted_total[5m] offset 1h) * 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "VictoriaMetrics 写入速率大幅下降"
# 活跃时间序列激增
- alert: VMActiveTimeSeriesHigh
expr: vm_active_timeseries > 5000000
for: 30m
labels:
severity: warning
annotations:
summary: "活跃时间序列数超过 500 万"
# 慢查询
- alert: VMSlowQueries
expr: vm_slow_queries_total > 10
for: 5m
labels:
severity: warning
annotations:
summary: "存在慢查询"
8.5 Recording Rules
8.5.1 什么是 Recording Rules
Recording Rules 将复杂查询的结果预计算并存储为新的时间序列,提升查询性能。
原始查询(复杂、慢):
histogram_quantile(0.99,
sum by (le, job) (rate(http_duration_bucket[5m]))
)
Recording Rule 预计算后:
job:http_duration:p99 ← 简单查询,快!
8.5.2 配置示例
groups:
- name: recording-rules
interval: 30s
rules:
# 预计算 P99 延迟
- record: job:http_request_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum by (le, job) (rate(http_duration_bucket[5m]))
)
# 预计算请求速率
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
# 预计算错误率
- record: job:http_errors:ratio5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m]))
# 预计算 CPU 使用率(按节点)
- record: host:cpu_usage:avg
expr: avg by (host) (cpu_usage)
8.5.3 Recording Rules 命名规范
<level>:<metric>:<aggregation>
示例:
job:http_request_duration_seconds:p99
└┬┘ └──────────┬────────────┘ └┬┘
│ │ └── 聚合类型
│ └── 原始指标名
└── 分组级别
级别:
- job: 按 job 聚合
- host: 按 host 聚合
- cluster: 按集群聚合
- instance: 按实例聚合
8.6 Alertmanager 集成
8.6.1 Alertmanager 安装
# Docker 方式运行 Alertmanager
docker run -d \
--name alertmanager \
-p 9093:9093 \
-v /etc/alertmanager:/etc/alertmanager \
prom/alertmanager:v0.27.0 \
--config.file=/etc/alertmanager/alertmanager.yml
8.6.2 Alertmanager 配置
# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
route:
# 默认路由
receiver: 'default-receiver'
group_by: ['alertname', 'env']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# critical 级别告警
- match:
severity: critical
receiver: 'critical-receiver'
repeat_interval: 1h
# warning 级别告警
- match:
severity: warning
receiver: 'warning-receiver'
repeat_interval: 4h
# VM 自身告警
- match:
team: vm-ops
receiver: 'vm-ops-receiver'
receivers:
- name: 'default-receiver'
email_configs:
- to: 'ops@example.com'
- name: 'critical-receiver'
email_configs:
- to: 'ops-critical@example.com'
webhook_configs:
- url: 'http://dingtalk-webhook:8060/dingtalk/ops/send'
send_resolved: true
- name: 'warning-receiver'
email_configs:
- to: 'ops@example.com'
- name: 'vm-ops-receiver'
webhook_configs:
- url: 'http://dingtalk-webhook:8060/dingtalk/vm-ops/send'
# 静默规则(可选)
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
8.6.3 多 Alertmanager 实例
# vmalert 支持多个 Alertmanager(HA)
vmalert \
-rule=/etc/vmalert/rules/*.yml \
-datasource.url=http://localhost:8428 \
-notifier.url=http://alertmanager1:9093 \
-notifier.url=http://alertmanager2:9093 \
-notifier.alertmanager.timeout=10s
8.7 告警测试与调试
8.7.1 测试规则语法
# 使用 vmalert 的 -rule.validateOnly 参数
vmalert \
-rule=/etc/vmalert/rules/*.yml \
-datasource.url=http://localhost:8428 \
-rule.validateOnly
8.7.2 在 VMUI 中测试表达式
# 在 VMUI 中直接测试告警表达式
# 如果查询有返回值,告警会触发
avg by (host) (cpu_usage) > 80
8.7.3 vmalert API
# 查看所有规则
curl http://localhost:8880/api/v1/rules
# 查看活跃告警
curl http://localhost:8880/api/v1/alerts
# 查看规则组状态
curl http://localhost:8880/api/v1/rules?group=infrastructure
8.7.4 回填历史告警
# 对规则进行回填评估(查看历史数据是否有触发)
vmalert \
-rule=/etc/vmalert/rules/*.yml \
-datasource.url=http://localhost:8428 \
-notifier.url=http://localhost:9093 \
-replay.timeFrom=2024-01-01T00:00:00Z \
-replay.timeTo=2024-01-31T23:59:59Z \
-replay.maxDataPoints=1000
8.8 告警最佳实践
8.8.1 告警级别定义
| 级别 | 含义 | 响应时间 | 通知方式 |
|---|
critical | 服务中断/数据丢失 | 5 分钟内 | 电话 + 短信 + 即时消息 |
warning | 性能下降/资源紧张 | 30 分钟内 | 即时消息 + 邮件 |
info | 需要关注但不紧急 | 下一工作日 | 邮件 / 工单 |
8.8.2 避免告警疲劳
# ❌ 不推荐:太短的 for 时间导致频繁触发
- alert: HighCPU
expr: cpu_usage > 80
for: 10s # 太短!抖动就会触发
# ✅ 推荐:合理的持续时间
- alert: HighCPU
expr: avg by (host) (cpu_usage) > 80
for: 5m # 持续 5 分钟才触发
# ✅ 使用聚合减少告警数量
- alert: HighCPU
expr: avg by (host) (cpu_usage) > 80 # 每个 host 一个告警
# 而不是
# cpu_usage > 80 # 每个指标一个告警,可能几百个
8.8.3 优雅降级
# 使用 absent 检测缺失的监控
- alert: MonitoringDown
expr: absent(up{job="victoria-metrics"})
for: 2m
labels:
severity: critical
annotations:
summary: "VictoriaMetrics 监控丢失 - 可能监控系统本身出现问题"
本章小结
| 要点 | 内容 |
|---|
| vmalert | 独立告警引擎,支持 Alerting Rules 和 Recording Rules |
| 规则格式 | 与 Prometheus 完全兼容 |
| Alertmanager | 支持多种通知渠道,支持 HA 部署 |
| Recording Rules | 预计算复杂查询,提升性能 |
| 最佳实践 | 合理 for 时间、使用聚合、避免告警疲劳 |
扩展阅读