强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

VictoriaMetrics 完全指南 / 12 - 自我监控

12 · 自我监控

本章目标

  • 了解 VictoriaMetrics 暴露的自我监控指标
  • 使用 Grafana 搭建监控仪表盘
  • 配置内置告警规则
  • 掌握健康检查与容量监控

12.1 自我监控指标

12.1.1 指标暴露端点

# 查看所有指标
curl http://localhost:8428/metrics

# 使用 Prometheus 格式
curl -s http://localhost:8428/metrics | head -50

12.1.2 关键指标分类

类别指标说明
写入vm_rows_inserted_total总写入行数
写入vm_slow_inserts_total慢写入计数
写入vm_inserts_total写入请求计数
查询vm_request_duration_seconds查询延迟直方图
查询vm_slow_queries_total慢查询计数
查询vm_concurrent_queries当前并发查询
存储vm_active_timeseries活跃时间序列数
存储vm_timeseries_created_total创建的序列总数
存储vm_partsPart 数量
存储vm_rows总数据行数
缓存vm_cache_entries缓存条目数
缓存vm_cache_size_bytes缓存大小
系统process_resident_memory_bytesRSS 内存
系统process_cpu_seconds_totalCPU 使用时间
系统process_open_fds打开文件描述符数

12.1.3 重要指标详解

# 写入吞吐率(samples/s)
rate(vm_rows_inserted_total{type="metric"}[5m])

# 查询延迟 P99
histogram_quantile(0.99,
    sum(rate(vm_request_duration_seconds_bucket{path="/api/v1/query"}[5m])) by (le)
)

# 活跃序列数
vm_active_timeseries

# 磁盘空间使用率
1 - (vm_free_disk_space_bytes / vm_total_disk_space_bytes)

# 缓存命中率
vm_cache_hits_total / (vm_cache_hits_total + vm_cache_misses_total)

# 慢查询速率
rate(vm_slow_queries_total[5m])

# 合并中 Part 数量
vm_merges_total

12.2 Grafana 仪表盘

12.2.1 配置数据源

# Grafana provisioning 数据源配置
# /etc/grafana/provisioning/datasources/victoriametrics.yml
apiVersion: 1

datasources:
  - name: VictoriaMetrics
    type: prometheus
    url: http://localhost:8428
    access: proxy
    isDefault: true
    jsonData:
      httpMethod: POST
      timeInterval: "15s"
    editable: true

12.2.2 推荐仪表盘

官方提供多个 Grafana 仪表盘:

Dashboard ID名称用途
10229VictoriaMetrics - single-node单节点监控
11176VictoriaMetrics - cluster集群监控
12683VictoriaMetrics - vmagentvmagent 监控
14950VictoriaMetrics - vmalertvmalert 监控

导入方式:

  1. Grafana → Dashboards → Import
  2. 输入 Dashboard ID
  3. 选择 VictoriaMetrics 数据源
  4. 点击 Import

12.2.3 核心面板说明

写入面板

# 写入速率
rate(vm_rows_inserted_total{type="metric"}[5m])

# 写入延迟
rate(vm_slow_inserts_total[5m])

# 并发写入
vm_concurrent_inserts

查询面板

# 查询 QPS
sum(rate(vm_requests_total{path=~"/api/v1/.*"}[5m]))

# 查询延迟分布
histogram_quantile(0.50, sum(rate(vm_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(vm_request_duration_seconds_bucket[5m])) by (le))

# 慢查询
increase(vm_slow_queries_total[1h])

存储面板

# 活跃序列数
vm_active_timeseries

# 磁盘使用
vm_free_disk_space_bytes

# Part 数量
sum(vm_parts)

系统面板

# 内存使用
process_resident_memory_bytes{job="victoria-metrics"}

# CPU 使用率
rate(process_cpu_seconds_total{job="victoria-metrics"}[5m]) * 100

# Goroutine 数量
go_goroutines{job="victoria-metrics"}

12.3 内置告警规则

VictoriaMetrics 官方提供了一套推荐的告警规则:

12.3.1 下载官方告警规则

# 单节点版告警规则
curl -LO https://raw.githubusercontent.com/VictoriaMetrics/VictoriaMetrics/master/deployment/docker/alerts.yml

# 集群版告警规则
curl -LO https://raw.githubusercontent.com/VictoriaMetrics/VictoriaMetrics/master/deployment/docker/alerts-cluster.yml

# vmalert 告警规则
curl -LO https://raw.githubusercontent.com/VictoriaMetrics/VictoriaMetrics/master/deployment/docker/alerts-vmalert.yml

12.3.2 核心告警规则

# /etc/vmalert/rules/vm-health.yml
groups:
  - name: vm-health
    rules:
      # 实例宕机
      - alert: TooManyRestarts
        expr: changes(process_start_time_seconds{job=~"victoria.*"}[15m]) > 2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }} 频繁重启"
          description: "{{ $labels.instance }} 在 15 分钟内重启了 {{ $value }} 次"

      # 写入速率下降
      - alert: RowsInsertRateDrop
        expr: |
          (
            rate(vm_rows_inserted_total[5m]) < 0.5 * 
            (rate(vm_rows_inserted_total[5m] offset 1h))
          ) and (
            rate(vm_rows_inserted_total[5m]) > 0
          )
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "写入速率下降到 1 小时前的 50% 以下"

      # 活跃序列激增
      - alert: TooHighActiveTimeSeries
        expr: vm_active_timeseries > 5000000
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "活跃时间序列超过 500 万"
          description: "当前活跃序列: {{ $value }}"

      # 磁盘空间不足
      - alert: DiskRunsOutOfSpaceIn24h
        expr: |
          predict_linear(vm_free_disk_space_bytes[1h], 24*3600) < 0
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} 磁盘预计 24 小时内耗尽"

      # 磁盘空间紧急
      - alert: DiskRunsOutOfSpace
        expr: vm_free_disk_space_bytes < 10 * 1024 * 1024 * 1024
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} 剩余磁盘空间不足 10 GB"

      # 内存使用过高
      - alert: TooHighMemoryUsage
        expr: |
          process_resident_memory_bytes / 
          (node_memory_MemTotal_bytes or process_resident_memory_bytes * 1.5) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} 内存使用率超过 90%"

      # 慢查询过多
      - alert: TooManySlowQueries
        expr: rate(vm_slow_queries_total[5m]) > 1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} 慢查询速率 > 1/s"

      # 请求错误率
      - alert: TooManyErrors
        expr: |
          sum(rate(vm_http_request_errors_total[5m])) by (instance) > 
          sum(rate(vm_http_requests_total[5m])) by (instance) * 0.01
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} HTTP 错误率超过 1%"

      # 查询超时
      - alert: TooSlowQueries
        expr: |
          histogram_quantile(0.99, 
            sum(rate(vm_request_duration_seconds_bucket[5m])) by (le, path)
          ) > 30
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.path }} P99 查询延迟超过 30 秒"

12.4 集成 Prometheus 监控

12.4.1 使用 vmagent 采集 VM 自身

# vmagent 配置
scrape_configs:
  - job_name: 'victoria-metrics'
    static_configs:
      - targets: ['localhost:8428']
    scrape_interval: 15s

  - job_name: 'vminsert'
    static_configs:
      - targets: ['vminsert1:8480', 'vminsert2:8480']

  - job_name: 'vmselect'
    static_configs:
      - targets: ['vmselect1:8481', 'vmselect2:8481']

  - job_name: 'vmstorage'
    static_configs:
      - targets: ['vmstorage1:8482', 'vmstorage2:8482', 'vmstorage3:8482']

12.4.2 自监控架构

┌─────────────────────────────────────────────┐
│                                             │
│  vmagent ──▶ VictoriaMetrics ──▶ Grafana    │
│     │              ▲                         │
│     │              │ 采集自身                 │
│     └──────────────┘                         │
│                                             │
│  vmalert ──▶ Alertmanager                    │
│     ▲                                        │
│     │ 查询                                   │
│     └──▶ VictoriaMetrics                     │
└─────────────────────────────────────────────┘

最佳实践:使用独立的 VM 实例来监控生产 VM(监控系统不应监控自身)。


12.5 运行时信息 API

# 构建信息
curl http://localhost:8428/api/v1/status/buildinfo

# TSDB 状态
curl http://localhost:8428/api/v1/status/tsdb

# 活跃查询
curl http://localhost:8428/api/v1/status/active_queries

# 健康检查
curl http://localhost:8428/health

# 进程信息
curl http://localhost:8428/metrics | grep "^process_"

本章小结

要点内容
指标端点/metrics 暴露 Prometheus 格式指标
Grafana使用官方仪表盘(ID: 10229/11176)
告警规则官方提供推荐规则,覆盖写入/查询/存储/系统
自监控架构推荐独立 VM 监控生产 VM
API/api/v1/status/* 提供运行时信息

扩展阅读