Prometheus 完全指南 / 18 - 最佳实践

18 - 最佳实践

18.1 指标命名规范

命名格式

<namespace>_<subsystem>_<name>_<unit_suffix>

部分	说明	示例
namespace	应用/组件名称	`myapp`, `node`, `redis`
subsystem	子系统	`http`, `db`, `cache`
name	指标名称	`requests_total`, `duration`
unit_suffix	单位后缀	`_seconds`, `_bytes`, `_total`

命名规则

规则	正确 ✅	错误 ❌
使用 snake_case	`http_requests_total`	`httpRequestsTotal`
Counter 以 `_total` 结尾	`errors_total`	`errors`
使用基础单位	`_seconds`	`_milliseconds`
使用基础单位	`_bytes`	`_megabytes`
避免重复	`app_http_requests`	`app_http_app_requests`
描述性名称	`db_connections_active`	`db_conns`

常见单位后缀

类型	后缀	示例
时间	`_seconds`	`http_duration_seconds`
字节	`_bytes`	`response_size_bytes`
比率	`_ratio`	`cache_hit_ratio`
计数	`_total`	`requests_total`
温度	`_celsius`	`cpu_temperature_celsius`

命名示例

// ✅ 推荐
var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Namespace: "myapp",
            Subsystem: "http",
            Name:      "requests_total",
            Help:      "Total HTTP requests",
        },
        []string{"method", "path", "status"},
    )
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Namespace: "myapp",
            Subsystem: "http",
            Name:      "request_duration_seconds",
            Help:      "HTTP request duration",
            Buckets:   prometheus.DefBuckets,
        },
        []string{"method", "path"},
    )
)

// ❌ 不推荐
var (
    reqCount = prometheus.NewCounter(
        prometheus.CounterOpts{
            Name: "req_count",  // 缺少 namespace/subsystem
            Help: "Request count",
        },
    )
)

18.2 标签设计原则

标签基数控制

标签的基数（Cardinality）是指标签可能的唯一值的数量。高基数是 Prometheus 最常见的性能杀手。

基数级别	示例	风险
低基数（✅ 安全）	`method: GET/POST/PUT/DELETE`	无
中基数（⚠️ 注意）	`status: 200/201/301/400/404/500`	可控
高基数（❌ 危险）	`user_id: 0-1000000`	内存爆炸
极高基数（🚫 禁止）	`request_id: UUID`	灾难性

禁止使用的标签

// ❌ 绝对不要用这些作为标签
user_id          // 百万级用户
request_id       // 每请求唯一
trace_id         // 链路追踪 ID
session_id       // 会话 ID
ip_address       // IP 地址
timestamp        // 时间戳
error_message    // 错误信息（文本）
url_full         // 完整 URL（无限组合）

标签设计示例

// ✅ 推荐的标签设计
var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{Name: "http_requests_total"},
    []string{
        "method",     // GET/POST/PUT/DELETE (4)
        "path",       // /api/users, /api/orders (有限)
        "status",     // 200/201/400/404/500 (有限)
        "handler",    // handler 函数名 (有限)
    },
)

// ❌ 危险的标签设计
var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{Name: "http_requests_total"},
    []string{
        "method",
        "url",        // 完整 URL，可能无限组合！
        "user_id",    // 用户 ID，百万级！
        "ip",         // IP 地址，百万级！
    },
)

标签命名规范

规则	正确 ✅	错误 ❌
使用 snake_case	`http_method`	`httpMethod`
避免标签前缀 `__`	`env`	`__env`（保留给系统）
标签值使用枚举	`method="GET"`	`method="get"`（大小写不一致）
避免空值	不传标签	`path=""`

业务场景标签设计

// 电商系统指标标签设计
var orderMetrics = prometheus.NewCounterVec(
    prometheus.CounterOpts{Name: "orders_created_total"},
    []string{
        "channel",        // web/app/miniapp (3)
        "payment_method", // alipay/wechat/card (3)
        "region",         // cn-east/cn-south/cn-north (3-5)
        "order_type",     // normal/refund (2)
        // 总组合数: 3 × 3 × 3 × 2 = 54 (安全)
    },
)

18.3 采集间隔设计

间隔选择

场景	推荐间隔	说明
关键业务指标	15s	实时监控
基础设施	15-30s	CPU/内存/磁盘
非关键指标	60s	Exporter 状态
批处理指标	5m	Pushgateway

间隔与 rate 窗口

采集间隔: 15s

rate() 窗口建议:
  最小: 30s (2x)
  推荐: 60s (4x)
  最大: 根据需求

irate() 窗口建议:
  最小: 30s (2x)
  推荐: 30-60s

18.4 容量规划

时间序列数量估算

总时间序列 = Σ (每个 Exporter 的序列数 × 实例数)

示例:
  Node Exporter: 500 序列 × 100 台 = 50,000
  MySQL Exporter: 200 序列 × 10 台 = 2,000
  应用指标: 100 序列 × 50 实例 = 5,000
  ─────────────────────────────────
  总计: 57,000 序列

存储空间估算

每日存储 = 序列数 × (86400 / 间隔) × 每样本字节数

示例 (57,000 序列, 15s 间隔, 2 bytes/sample):
  每日: 57,000 × 5,760 × 2 = 656 MB/天 (压缩前)
  压缩后: ~100-200 MB/天
  15天保留: ~1.5-3 GB

内存估算

内存 ≈ 活跃序列数 × 8-16 KB

示例 (57,000 序列):
  保守: 57,000 × 8 KB = 445 MB
  激进: 57,000 × 16 KB = 891 MB
  推荐: 2-4 GB (留余量)

容量规划表

序列数	内存建议	磁盘建议 (15天)	CPU 建议
< 10K	2 GB	10 GB	1 核
10K - 100K	4-8 GB	50 GB	2 核
100K - 500K	16-32 GB	200 GB	4 核
500K - 1M	32-64 GB	500 GB	8 核
> 1M	考虑分片/Thanos	对象存储	8+ 核

18.5 高可用设计

双写架构

                  ┌──────────────┐
                  │ Load Balancer │
                  └──────┬───────┘
                         │
              ┌──────────┼──────────┐
              ▼                     ▼
    ┌──────────────┐      ┌──────────────┐
    │ Prometheus A  │      │ Prometheus B  │
    │ (相同配置)    │      │ (相同配置)    │
    └──────┬───────┘      └──────┬───────┘
           │                     │
           └──────────┬──────────┘
                      ▼
              ┌──────────────┐
              │ Alertmanager │
              │ (自动去重)    │
              └──────────────┘

Thanos Sidecar 架构（推荐）

Prometheus A + Sidecar ──┐
                         ├──► Thanos Query (去重 + 全局)
Prometheus B + Sidecar ──┘
                         │
                         ▼
                   对象存储 (长期)

18.6 告警最佳实践

告警分级

级别	含义	响应时间	通知方式
critical	服务不可用	立即	电话 + 短信
warning	服务降级	30 分钟	IM + 邮件
info	信息通知	下一工作日	邮件

告警规则编写

# ✅ 好的告警规则
- alert: HighErrorRate
  expr: |
    sum by(job) (rate(http_requests_total{status=~"5.."}[5m]))
    / sum by(job) (rate(http_requests_total[5m]))
    > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.job }} 错误率 {{ $value | humanizePercentage }}"

# ❌ 不好的告警规则
- alert: HighErrorRate
  expr: http_requests_total{status="500"} > 100  # 绝对值没有意义
  for: 0s  # 没有 for，容易误报

避免告警风暴

# 1. 使用 for 过滤瞬态告警
for: 5m

# 2. 使用抑制规则
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, instance]

# 3. 使用分组合并
route:
  group_by: [alertname, cluster]
  group_wait: 30s
  group_interval: 5m

18.7 安全最佳实践

访问控制

# 启用 TLS 和 Basic Auth
# web.yml
tls_server_config:
  cert_file: /etc/prometheus/tls/server.crt
  key_file: /etc/prometheus/tls/server.key
basic_auth_users:
  admin: $2y$10$hash

# 限制网络访问
--web.listen-address=127.0.0.1:9090

敏感信息管理

# ❌ 不要在配置中硬编码密码
basic_auth:
  username: admin
  password: secret123

# ✅ 使用密码文件
basic_auth:
  username: admin
  password_file: /etc/prometheus/auth/password

# ✅ 使用环境变量
basic_auth:
  username: ${PROM_USER}
  password: ${PROM_PASSWORD}

18.8 运维检查清单

日常检查

Prometheus 服务状态正常
所有目标状态 UP
磁盘空间充足（> 30%）
内存使用正常
无规则评估错误
Alertmanager 正常运行

周期检查

审查高基数指标
评估保留策略
检查告警规则有效性
更新 Exporter 版本
备份配置和规则
容量规划评估

升级检查

阅读 Release Notes
在测试环境验证
备份数据目录
准备回滚方案
监控升级后的性能

18.9 生产部署检查清单

项目	检查内容
配置	promtool check config
规则	promtool check rules
保留	设置 retention.time 或 retention.size
存储	独立磁盘/SSD
TLS	启用 HTTPS
认证	启用 Basic Auth
高可用	双实例 + 去重
告警	Alertmanager 集群
监控	监控 Prometheus 自身
备份	定期备份配置和规则

18.10 本章小结

维度	最佳实践
命名	`<namespace>_<subsystem>_<name>_<unit>`
标签	控制基数，避免 user_id/request_id
间隔	15-30s，rate 窗口 ≥ 4x 间隔
容量	预估序列数，预留 20-30% 资源
高可用	双写 + Thanos 去重
告警	分级、for、抑制、分组
安全	TLS + Auth + 网络隔离

扩展阅读

上一章：17 - 故障排查

🎉 恭喜！你已完成 Prometheus 完全指南全部 18 章的学习！

回到目录：Prometheus 完全指南