Memcached 完全指南 / 第 12 章:监控与告警
第 12 章:监控与告警
12.1 监控指标体系
关键指标分类
| 类别 |
指标 |
告警阈值 |
说明 |
| 健康 |
uptime |
< 60s |
刚重启 |
| 命中率 |
hit_rate |
< 80% |
缓存效率低 |
| 内存 |
mem_used / mem_limit |
> 90% |
内存即将耗尽 |
| 连接 |
curr_connections / max_connections |
> 80% |
连接数接近上限 |
| 淘汰 |
evictions |
> 0 持续增长 |
内存不足导致淘汰 |
| QPS |
cmd_get + cmd_set |
取决于基线 |
流量异常 |
| 延迟 |
响应时间 P99 |
> 5ms |
性能劣化 |
| Slab |
slab_automove |
异常 |
Slab 分配问题 |
12.2 stats 命令详解
基本统计
echo "stats" | nc localhost 11211
| 指标 |
说明 |
重要程度 |
pid |
进程 ID |
★ |
uptime |
运行时间(秒) |
★★★ |
time |
当前 Unix 时间戳 |
★ |
version |
版本号 |
★★ |
libevent |
libevent 版本 |
★ |
pointer_size |
指针位数 |
★ |
rusage_user |
用户态 CPU 时间 |
★★ |
rusage_system |
内核态 CPU 时间 |
★★ |
curr_connections |
当前连接数 |
★★★★★ |
total_connections |
累计连接数 |
★★ |
connection_structures |
已分配的连接结构数 |
★★ |
rejected_connections |
被拒绝的连接数 |
★★★★ |
cmd_get |
GET 请求数 |
★★★★ |
cmd_set |
SET 请求数 |
★★★★ |
cmd_flush |
FLUSH 请求数 |
★★★ |
cmd_touch |
TOUCH 请求数 |
★★ |
get_hits |
GET 命中数 |
★★★★★ |
get_misses |
GET 未命中数 |
★★★★★ |
get_expired |
GET 过期数 |
★★★ |
get_flushed |
GET 被 flush 数 |
★★ |
delete_misses |
DELETE 未命中数 |
★★ |
delete_hits |
DELETE 命中数 |
★★ |
incr_misses |
INCR 未命中数 |
★★ |
incr_hits |
INCR 命中数 |
★★ |
decr_misses |
DECR 未命中数 |
★★ |
decr_hits |
DECR 命中数 |
★★ |
cas_misses |
CAS 未命中数 |
★★ |
cas_hits |
CAS 命中数 |
★★ |
cas_badval |
CAS 值不匹配数 |
★★ |
bytes |
当前存储字节数 |
★★★★★ |
limit_maxbytes |
最大内存限制 |
★★★★★ |
curr_items |
当前 Item 数 |
★★★★ |
total_items |
累计 Item 数 |
★★★ |
evictions |
淘汰次数 |
★★★★★ |
bytes_read |
读取字节数 |
★★ |
bytes_written |
写入字节数 |
★★ |
threads |
Worker 线程数 |
★★ |
hash_power_level |
哈希表幂次 |
★★ |
hash_bytes |
哈希表字节数 |
★★ |
hash_is_expanding |
哈希表是否扩展中 |
★★ |
slab_reassign_running |
Slab 迁移是否运行中 |
★★ |
slabs_moved |
Slab 迁移次数 |
★★ |
计算命中率
#!/bin/bash
# 计算 Memcached 命中率
STATS=$(echo "stats" | nc localhost 11211)
HITS=$(echo "$STATS" | grep "get_hits" | awk '{print $3}')
MISSES=$(echo "$STATS" | grep "get_misses" | awk '{print $3}')
TOTAL=$((HITS + MISSES))
if [ $TOTAL -gt 0 ]; then
HIT_RATE=$(echo "scale=2; $HITS * 100 / $TOTAL" | bc)
echo "命中率: ${HIT_RATE}%"
echo "命中: $HITS, 未命中: $MISSES, 总计: $TOTAL"
else
echo "暂无请求数据"
fi
Item 统计
echo "stats items" | nc localhost 11211
# STAT items:1:number 523
# STAT items:1:number_hot 100
# STAT items:1:number_warm 150
# STAT items:1:number_cold 250
# STAT items:1:number_temp 23
# STAT items:1:age 1234
# STAT items:1:evicted 50
# STAT items:1:evicted_nonzero 40
# STAT items:1:evicted_time 300
# STAT items:1:outofmemory 5
# STAT items:1:tailrepairs 10
Slab 统计
echo "stats slabs" | nc localhost 11211
Settings 统计
echo "stats settings" | nc localhost 11211
# STAT maxbytes 134217728
# STAT maxconns 1024
# STAT tcpport 11211
# STAT udpport 0
# STAT inter 127.0.0.1
# STAT verbosity 0
# STAT oldest 0
# STAT evictions on
# STAT domain_socket NULL
# STAT umask 700
# STAT growth_factor 1.25
# STAT chunk_size 48
# STAT num_threads 4
# STAT num_threads_per_udp 4
# STAT stat_key_prefix :
# STAT detail_enabled no
# STAT reqs_per_event 20
# STAT cas_enabled yes
# STAT tcp_backlog 1024
# STAT binding_protocol auto-negotiate
# STAT auth_enabled_sasl no
# STAT item_size_max 1048576
# STAT maxconns_fast yes
# STAT hashpower_init 0
# STAT slab_reassign yes
# STAT slab_automove 1
# STAT lru_maintainer_thread yes
# STAT lru_crawler no
# STAT lru_crawler_sleep 100
# STAT lru_crawler_tocrawl 0
# STAT tail_repair_time 0
# STAT flush_enabled yes
# STAT dump_flawed no
# STAT hash_algorithm murmur3
12.3 Prometheus + Grafana 监控
方案架构
┌──────────────┐ ┌─────────────────────┐ ┌──────────┐
│ Memcached │────▶│ Exporter │────▶│Prometheus│
│ :11211 │stats│ (memcached_exporter)│ │ :9090 │
└──────────────┘ │ :9150 │ └────┬─────┘
└─────────────────────┘ │
▼
┌──────────┐
│ Grafana │
│ :3000 │
└──────────┘
部署 Memcached Exporter
# Docker 方式
docker run -d --name memcached-exporter \
-p 9150:9150 \
prom/memcached-exporter \
--memcached.address=memcached:11211
# 或使用二进制
wget https://github.com/prometheus/memcached_exporter/releases/download/v0.14.4/memcached_exporter-0.14.4.linux-amd64.tar.gz
tar xzf memcached_exporter-0.14.4.linux-amd64.tar.gz
./memcached_exporter --memcached.address=localhost:11211
Prometheus 配置
# prometheus.yml
scrape_configs:
- job_name: 'memcached'
static_configs:
- targets:
- 'mc-exporter1:9150'
- 'mc-exporter2:9150'
- 'mc-exporter3:9150'
scrape_interval: 15s
scrape_timeout: 10s
核心 Exporter 指标
| Prometheus 指标 |
含义 |
类型 |
memcached_up |
实例是否存活 |
gauge |
memcached_items_total |
Item 总数 |
gauge |
memcached_current_bytes |
当前使用字节数 |
gauge |
memcached_limit_bytes |
内存限制 |
gauge |
memcached_commands_total |
命令总数(按类型) |
counter |
memcached_connections_total |
连接数 |
gauge |
memcached_current_items |
当前 Item 数 |
gauge |
memcached_evictions_total |
淘汰总数 |
counter |
memcached_slab_chunk_size |
Slab chunk 大小 |
gauge |
memcached_slab_chunks_free |
Slab 空闲 chunk |
gauge |
memcached_slab_chunks_used |
Slab 已用 chunk |
gauge |
Grafana 仪表盘
推荐使用社区提供的模板:
# 导入 Grafana 仪表盘 ID: 11987 (Memcached Overview)
# 或 ID: 2279 (Memcached Full)
常用 PromQL 查询
# 命中率
sum(rate(memcached_commands_total{command="get",status="hit"}[5m]))
/
sum(rate(memcached_commands_total{command="get"}[5m]))
* 100
# QPS
sum(rate(memcached_commands_total[5m]))
# 内存使用率
memcached_current_bytes / memcached_limit_bytes * 100
# 连接使用率
memcached_current_connections / memcached_max_connections * 100
# 淘汰速率
rate(memcached_evictions_total[5m])
# 各命令 QPS
sum by (command) (rate(memcached_commands_total[5m]))
12.4 告警规则
Prometheus AlertManager 规则
# memcached_alerts.yml
groups:
- name: memcached
rules:
# 实例宕机
- alert: MemcachedDown
expr: memcached_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Memcached 实例宕机"
description: "{{ $labels.instance }} 已宕机超过 1 分钟"
# 命中率低
- alert: MemcachedHitRateLow
expr: |
sum(rate(memcached_commands_total{command="get",status="hit"}[5m]))
/ sum(rate(memcached_commands_total{command="get"}[5m]))
* 100 < 80
for: 5m
labels:
severity: warning
annotations:
summary: "Memcached 命中率低于 80%"
description: "当前命中率: {{ $value }}%"
# 内存使用率高
- alert: MemcachedMemoryHigh
expr: memcached_current_bytes / memcached_limit_bytes * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Memcached 内存使用率超过 90%"
# 连接数接近上限
- alert: MemcachedConnectionsHigh
expr: memcached_current_connections / memcached_max_connections * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Memcached 连接数超过 80%"
# 持续淘汰
- alert: MemcachedEvictions
expr: rate(memcached_evictions_total[5m]) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "Memcached 持续淘汰数据"
description: "淘汰速率: {{ $value }}/s"
12.5 自定义监控脚本
完整监控脚本
#!/usr/bin/env python3
"""Memcached 监控脚本"""
import socket
import time
import json
import sys
def get_stats(host='localhost', port=11211):
s = socket.socket()
s.settimeout(5)
s.connect((host, port))
s.send(b'stats\r\n')
data = b''
while True:
chunk = s.recv(4096)
data += chunk
if b'END\r\n' in chunk:
break
s.close()
stats = {}
for line in data.decode().split('\r\n'):
if line.startswith('STAT '):
parts = line.split()
stats[parts[1]] = parts[2]
return stats
def check_health(stats):
alerts = []
# 命中率
hits = int(stats.get('get_hits', 0))
misses = int(stats.get('get_misses', 0))
total = hits + misses
if total > 0:
hit_rate = hits / total * 100
if hit_rate < 80:
alerts.append(f"命中率过低: {hit_rate:.1f}%")
# 内存使用率
used = int(stats.get('bytes', 0))
limit = int(stats.get('limit_maxbytes', 1))
mem_pct = used / limit * 100
if mem_pct > 90:
alerts.append(f"内存使用率过高: {mem_pct:.1f}%")
# 连接数
curr_conn = int(stats.get('curr_connections', 0))
max_conn = int(stats.get('max_connections', 1))
conn_pct = curr_conn / max_conn * 100
if conn_pct > 80:
alerts.append(f"连接数过高: {conn_pct:.1f}%")
# 淘汰
evictions = int(stats.get('evictions', 0))
if evictions > 0:
alerts.append(f"存在淘汰: {evictions}")
# 拒绝连接
rejected = int(stats.get('rejected_connections', 0))
if rejected > 0:
alerts.append(f"存在拒绝连接: {rejected}")
return alerts
def print_report(stats):
hits = int(stats.get('get_hits', 0))
misses = int(stats.get('get_misses', 0))
total = hits + misses
hit_rate = (hits / total * 100) if total > 0 else 0
print(f"""
Memcached 监控报告
═══════════════════════════════════
版本: {stats.get('version', 'N/A')}
运行时间: {int(stats.get('uptime', 0)) // 3600} 小时
线程数: {stats.get('threads', 'N/A')}
━━ 命中率 ━━━━━━━━━━━━━━━━━━━━━━
命中率: {hit_rate:.2f}%
命中数: {hits}
未命中数: {misses}
━━ 内存 ━━━━━━━━━━━━━━━━━━━━━━━
已用: {int(stats.get('bytes', 0)) / 1048576:.1f} MB
上限: {int(stats.get('limit_maxbytes', 0)) / 1048576:.1f} MB
使用率: {int(stats.get('bytes', 0)) / max(int(stats.get('limit_maxbytes', 1)), 1) * 100:.1f}%
Item 数: {stats.get('curr_items', 'N/A')}
━━ 流量 ━━━━━━━━━━━━━━━━━━━━━━━
GET: {stats.get('cmd_get', 'N/A')}
SET: {stats.get('cmd_set', 'N/A')}
DELETE: {stats.get('cmd_delete', 'N/A')}
INCR: {stats.get('cmd_incr', 'N/A')}
DECR: {stats.get('cmd_decr', 'N/A')}
━━ 连接 ━━━━━━━━━━━━━━━━━━━━━━━
当前连接: {stats.get('curr_connections', 'N/A')}
最大连接: {stats.get('max_connections', 'N/A')}
被拒绝: {stats.get('rejected_connections', 'N/A')}
━━ 淘汰 ━━━━━━━━━━━━━━━━━━━━━━━
淘汰数: {stats.get('evictions', 'N/A')}
""")
if __name__ == '__main__':
host = sys.argv[1] if len(sys.argv) > 1 else 'localhost'
port = int(sys.argv[2]) if len(sys.argv) > 2 else 11211
stats = get_stats(host, port)
print_report(stats)
alerts = check_health(stats)
if alerts:
print("⚠️ 告警:")
for a in alerts:
print(f" - {a}")
else:
print("✅ 状态正常")
12.6 日志分析
启用详细日志
# 启动时设置日志级别
memcached -vv # 详细日志(显示每次 get/set)
memcached -vvv # 非常详细(调试用)
# 运行时调整
echo "verbosity 2" | nc localhost 11211
日志级别
| 级别 |
参数 |
内容 |
| 0 |
-v |
错误和关键信息 |
| 1 |
-vv |
添加连接/断开信息 |
| 2 |
-vvv |
添加每次命令执行 |
扩展阅读
小结
| 要点 |
内容 |
| 核心指标 |
命中率、内存使用率、连接数、淘汰数 |
| 命中率 |
get_hits / (get_hits + get_misses),保持 > 80% |
| 推荐方案 |
Prometheus + memcached_exporter + Grafana |
| 告警阈值 |
内存 > 90%、连接 > 80%、命中率 < 80%、淘汰 > 0 |