微服务拆分精讲 / 第 11 章：可观测性

第 11 章：可观测性

看不见就管不了。微服务的可观测性不是锦上添花，而是生死攸关的基础设施。

11.1 可观测性三大支柱

11.1.1 概念

┌──────────────────────────────────────────────────────────────┐
│               可观测性 (Observability) 三大支柱               │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│          ┌─────────────────────────────────────┐            │
│          │         可观测性 (Observability)      │            │
│          └────────────────┬────────────────────┘            │
│              ┌────────────┼────────────┐                    │
│              ▼            ▼            ▼                    │
│        ┌──────────┐ ┌──────────┐ ┌──────────┐              │
│        │  指标     │ │  日志     │ │  链路追踪  │              │
│        │ Metrics  │ │  Logs    │ │ Traces   │              │
│        ├──────────┤ ├──────────┤ ├──────────┤              │
│        │ "系统    │ │ "发生了  │ │ "请求经过 │              │
│        │  健康吗？"│ │  什么？" │ │  了哪里？"│              │
│        ├──────────┤ ├──────────┤ ├──────────┤              │
│        │ 时序数据  │ │ 文本记录  │ │ 请求链路  │              │
│        │ 聚合统计  │ │ 结构化   │ │ 耗时分析  │              │
│        ├──────────┤ ├──────────┤ ├──────────┤              │
│        │Prometheus│ │  ELK/Loki│ │ Jaeger/  │              │
│        │ Grafana  │ │          │ │ Zipkin   │              │
│        └──────────┘ └──────────┘ └──────────┘              │
│                                                              │
│  三者关系：                                                  │
│  Metrics 告诉你 "有问题" → Logs 告诉你 "什么问题"            │
│  → Traces 告诉你 "在哪里出了问题"                            │
└──────────────────────────────────────────────────────────────┘

11.1.2 对比

维度	指标 (Metrics)	日志 (Logs)	链路追踪 (Traces)
数据类型	数值型	文本型	结构化
存储	时序数据库	文档数据库	追踪存储
查询	聚合查询	全文搜索	链路查询
典型工具	Prometheus	ELK/Loki	Jaeger/Zipkin
关注点	趋势/告警	详情/排错	调用链/瓶颈

11.2 指标监控（Metrics）

11.2.1 指标类型

类型	说明	示例
Counter	只增不减的计数器	HTTP 请求总数、错误总数
Gauge	可增可减的瞬时值	内存使用量、CPU 使用率、队列长度
Histogram	数据分布（分桶统计）	请求延迟分布
Summary	分位数统计	P50/P95/P99 延迟

  指标类型可视化：

  Counter (累计)             Gauge (瞬时值)
  ─────────────             ─────────────
  ▲  requests                ▲  memory
  │      /                  │  /\/\/\
  │     /                   │ /      \
  │    /                    │/        \
  │   /                     │
  └──────────▶ t            └──────────▶ t
  (只增不减)                 (上下波动)

  Histogram (分布)           Summary (分位数)
  ─────────────             ─────────────
  ▲  latency                 ▲  P99=500ms
  │  ██                     │  ────────
  │  ████                   │  P95=200ms
  │  ██████                 │  ────────
  │  ████████               │  P50=50ms
  └──────────▶ ms            └──────────
  (分桶统计)                 (百分位)

11.2.2 Prometheus 架构

┌──────────────────────────────────────────────────────────────┐
│                   Prometheus 架构                             │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                  │
│  │ 服务 A   │  │ 服务 B   │  │ 服务 C   │  暴露 /metrics   │
│  │ /metrics │  │ /metrics │  │ /metrics │                  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘                  │
│       │              │              │                        │
│       └──────────────┼──────────────┘                        │
│                      ▼                                      │
│            ┌──────────────────┐                              │
│            │   Prometheus     │                              │
│            │   Server         │                              │
│            │                  │                              │
│            │  ┌────────────┐ │                              │
│            │  │  TSDB      │ │  时序数据库                   │
│            │  │ (本地存储)  │ │                              │
│            │  └────────────┘ │                              │
│            │                  │                              │
│            │  ┌────────────┐ │                              │
│            │  │  PromQL    │ │  查询语言                     │
│            │  └────────────┘ │                              │
│            │                  │                              │
│            │  ┌────────────┐ │     ┌──────────────────┐    │
│            │  │ Alertmanager│──────▶│  告警通知         │    │
│            │  └────────────┘ │     │  (邮件/钉钉/Slack)│    │
│            └────────┬─────────┘     └──────────────────┘    │
│                     │                                        │
│                     ▼                                        │
│            ┌──────────────────┐                              │
│            │    Grafana       │  可视化仪表盘                 │
│            └──────────────────┘                              │
└──────────────────────────────────────────────────────────────┘

11.2.3 Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s        # 采集间隔
  evaluation_interval: 15s    # 规则评估间隔

scrape_configs:
  - job_name: 'user-service'
    metrics_path: '/actuator/prometheus'  # Spring Boot Actuator
    static_configs:
      - targets: ['user-service:8080']
        labels:
          service: user-service
          env: production

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alert_rules.yml"

11.2.4 PromQL 常用查询

# HTTP 请求速率 (QPS)
rate(http_requests_total[5m])

# 错误率
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# P99 延迟
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# 内存使用率
process_resident_memory_bytes / 1024 / 1024

# 每个服务的请求速率
sum by (service) (rate(http_requests_total[5m]))

11.2.5 告警规则

# alert_rules.yml
groups:
  - name: microservice_alerts
    rules:
      # 错误率超过 5%
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "服务 {{ $labels.service }} 错误率过高"
          description: "错误率: {{ $value | humanizePercentage }}"

      # P99 延迟超过 500ms
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "服务 {{ $labels.service }} P99 延迟过高"
          description: "P99 延迟: {{ $value | humanizeDuration }}"

      # 服务实例数为 0
      - alert: ServiceDown
        expr: up{job=~".*-service"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务 {{ $labels.job }} 不可用"

11.3 日志聚合（Logging）

11.3.1 结构化日志

  ❌ 非结构化日志（难以搜索和分析）
  2026-05-10 10:30:00 INFO OrderService - 用户张三创建了订单，金额299元

  ✅ 结构化日志（JSON 格式，便于搜索和聚合）
  {
    "timestamp": "2026-05-10T10:30:00+08:00",
    "level": "INFO",
    "service": "order-service",
    "traceId": "abc-123-def",
    "spanId": "span-001",
    "userId": "user-001",
    "orderId": "ORD-20260510-001",
    "amount": 299.00,
    "message": "Order created successfully"
  }

11.3.2 ELK Stack 架构

┌──────────────────────────────────────────────────────────────┐
│                    ELK Stack 架构                             │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                  │
│  │ 服务 A   │  │ 服务 B   │  │ 服务 C   │  输出日志         │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘                  │
│       │              │              │                        │
│       └──────────────┼──────────────┘                        │
│                      ▼                                      │
│            ┌──────────────────┐                              │
│            │   Filebeat       │  日志采集器                   │
│            └────────┬─────────┘                              │
│                     ▼                                        │
│            ┌──────────────────┐                              │
│            │   Logstash       │  日志处理/转换                │
│            │   (或直接到 ES)  │                              │
│            └────────┬─────────┘                              │
│                     ▼                                        │
│            ┌──────────────────┐                              │
│            │ Elasticsearch    │  存储和搜索                   │
│            └────────┬─────────┘                              │
│                     ▼                                        │
│            ┌──────────────────┐                              │
│            │    Kibana        │  可视化/查询                  │
│            └──────────────────┘                              │
└──────────────────────────────────────────────────────────────┘

11.3.3 Loki：轻量级日志方案

  Loki (Grafana Labs) 架构：

  ┌──────────┐         ┌──────────┐         ┌──────────┐
  │ Promtail │────────▶│  Loki    │────────▶│ Grafana  │
  │ (采集)   │         │ (存储)   │         │ (查询)   │
  └──────────┘         └──────────┘         └──────────┘

  特点：
  • 只索引标签 (label)，不索引日志内容 → 存储成本低
  • 与 Prometheus 标签体系一致
  • 使用 LogQL 查询语言
  • 轻量级，资源占用少

11.3.4 日志查询示例

  Kibana (KQL) 查询示例：

  # 查询 order-service 的错误日志
  service: "order-service" AND level: "ERROR"

  # 查询特定订单的日志
  orderId: "ORD-20260510-001"

  # 查询最近 1 小时的超时错误
  service: "payment-service" AND message: "timeout"
  AND @timestamp >= "now-1h"

  Grafana Loki (LogQL) 查询示例：

  {service="order-service"} |= "error" | logfmt | duration > 1s
  {service="order-service", level="ERROR"} | json | userId="user-001"

11.4 链路追踪（Distributed Tracing）

11.4.1 核心概念

概念	说明
Trace	一个完整请求的端到端调用链
Span	Trace 中的一个操作单元（如一次 RPC 调用）
TraceID	全局唯一标识，贯穿整个请求链
SpanID	单个 Span 的唯一标识
ParentSpanID	父 Span 的 ID，构成调用树

  Trace 示例：用户下单

  TraceID: abc-123-def
  │
  ├── Span 1: API Gateway (10ms)
  │   └── Span 2: Order Service - createOrder (150ms)
  │       ├── Span 3: User Service - getUser (20ms)
  │       ├── Span 4: Product Service - getProduct (30ms)
  │       ├── Span 5: Inventory Service - deductStock (50ms)
  │       └── Span 6: Payment Service - pay (80ms)
  │           └── Span 7: Bank API - transfer (60ms)
  │
  时间轴：
  |---Gateway---|
       |---Order Service (150ms)---|
            |User| |Product| |---Inventory---| |---Payment---|
                                            |---Bank API---|
  0ms       10ms  30ms  60ms         110ms           150ms

11.4.2 OpenTelemetry

OpenTelemetry（OTel）是 CNCF 的可观测性标准，统一了 Metrics、Logs、Traces 的采集。

  OpenTelemetry 架构：

  ┌──────────────────────────────────────────────────────────┐
  │                 OpenTelemetry                             │
  │                                                          │
  │  应用层 (SDK)                                            │
  │  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
  │  │ Traces   │  │ Metrics  │  │ Logs     │              │
  │  │ API/SDK  │  │ API/SDK  │  │ API/SDK  │              │
  │  └────┬─────┘  └────┬─────┘  └────┬─────┘              │
  │       └──────────────┼──────────────┘                    │
  │                      ▼                                    │
  │  ┌────────────────────────────────────────────────────┐  │
  │  │          OpenTelemetry Collector                   │  │
  │  │                                                    │  │
  │  │  Receivers ──▶ Processors ──▶ Exporters           │  │
  │  │  (OTLP/Jaeger)  (过滤/批处理)  (Prometheus/Jaeger) │  │
  │  └────────────────────────────────────────────────────┘  │
  │                      │                                    │
  │       ┌──────────────┼──────────────┐                    │
  │       ▼              ▼              ▼                    │
  │  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
  │  │ Jaeger   │  │Prometheus│  │  Loki    │              │
  │  │ (Traces) │  │(Metrics) │  │ (Logs)   │              │
  │  └──────────┘  └──────────┘  └──────────┘              │
  │                      │                                    │
  │                      ▼                                    │
  │               ┌──────────┐                               │
  │               │ Grafana  │  统一可视化                    │
  │               └──────────┘                               │
  └──────────────────────────────────────────────────────────┘

11.4.3 Spring Boot 集成 OpenTelemetry

<!-- pom.xml -->
<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>

# application.yml
otel:
  service:
    name: order-service
  exporter:
    otlp:
      endpoint: http://otel-collector:4317
  traces:
    sampler:
      probability: 0.1  # 采样率 10%

11.4.4 链路追踪工具对比

维度	Jaeger	Zipkin	SkyWalking	Tempo
开发者	Uber	Twitter	Apache	Grafana Labs
存储	ES/Cassandra/Kafka	ES/Cassandra/MySQL	ES/H2/MySQL	对象存储
语言支持	多语言	多语言	Java 为主(+Agent)	多语言
服务拓扑	✅	✅	✅	✅
采样策略	多种	多种	多种	多种
与 Grafana 集成	✅	✅	⚠️	✅ 原生

11.5 Grafana 统一仪表盘

  Grafana 仪表盘示例：

  ┌──────────────────────────────────────────────────────────────┐
  │  微服务监控仪表盘 - 生产环境                                   │
  ├──────────────────────────────────────────────────────────────┤
  │                                                              │
  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
  │  │ 总 QPS       │  │ 总错误率      │  │ 总 P99 延迟   │      │
  │  │  12,543      │  │  0.12%       │  │  89ms        │      │
  │  │  ▲ +5%       │  │  ▼ -0.02%   │  │  ▼ -10ms     │      │
  │  └──────────────┘  └──────────────┘  └──────────────┘      │
  │                                                              │
  │  ┌─────────────────────────────────────────────────────┐   │
  │  │ 每个服务的 QPS 趋势 (折线图)                          │   │
  │  │  ▲                                                   │   │
  │  │  │    ╱╲   ╱╲                                       │   │
  │  │  │   ╱  ╲ ╱  ╲                                      │   │
  │  │  │  ╱    ╳    ╲                                     │   │
  │  │  │ ╱   ╱ ╲     ╲                                    │   │
  │  │  └──────────────────▶                               │   │
  │  └─────────────────────────────────────────────────────┘   │
  │                                                              │
  │  ┌─────────────────────┐  ┌─────────────────────┐         │
  │  │ 服务调用拓扑图       │  │ 最近错误日志          │         │
  │  │  (Loki + Tempo)     │  │  (Loki)             │         │
  │  │  GW→OS→PS→IS→PS    │  │  ERROR timeout...   │         │
  │  └─────────────────────┘  └─────────────────────┘         │
  └──────────────────────────────────────────────────────────────┘

11.6 告警体系

11.6.1 告警级别

级别	触发条件	通知方式	响应时间
P0 (致命)	服务不可用、数据丢失	电话 + 短信 + 钉钉	5 分钟
P1 (严重)	错误率 > 5%、延迟 > 1s	短信 + 钉钉	15 分钟
P2 (警告)	错误率 > 1%、资源 > 80%	钉钉/Slack	1 小时
P3 (通知)	非紧急变更	邮件	下个工作日

11.6.2 Golden Signals（Google SRE）

信号	说明	PromQL
Latency	请求延迟	`histogram_quantile(0.99, ...)`
Traffic	请求速率	`rate(http_requests_total[5m])`
Errors	错误率	`rate(http_requests_total{status=~"5.."}[5m])`
Saturation	资源饱和度	`CPU/Memory/Disk usage`

11.7 业务场景：电商系统可观测性方案

  ┌──────────────────────────────────────────────────────────────┐
  │              电商系统可观测性架构                              │
  ├──────────────────────────────────────────────────────────────┤
  │                                                              │
  │  数据采集层                                                   │
  │  ┌──────────┐ ┌──────────┐ ┌──────────┐                    │
  │  │ OpenTelemetry│Promtail│ │Prometheus│                    │
  │  │ (Traces) │  │ (Logs)  │ │(Metrics) │                    │
  │  └────┬─────┘  └────┬────┘  └────┬────┘                    │
  │       └──────────────┼──────────────┘                       │
  │                      ▼                                      │
  │  数据存储层                                                   │
  │  ┌──────────┐ ┌──────────┐ ┌──────────┐                    │
  │  │  Tempo   │ │  Loki    │ │Prometheus│                    │
  │  │ (Traces) │ │ (Logs)   │ │(Metrics) │                    │
  │  └────┬─────┘  └────┬────┘  └────┬────┘                    │
  │       └──────────────┼──────────────┘                       │
  │                      ▼                                      │
  │  可视化层                                                    │
  │  ┌──────────────────────────────────────────────────────┐  │
  │  │                   Grafana                             │  │
  │  │  ┌─────────┐  ┌─────────┐  ┌─────────┐              │  │
  │  │  │ 业务大盘 │  │ 技术大盘 │  │ 告警面板 │              │  │
  │  │  │ 订单量   │  │ QPS/延迟│  │ 异常通知 │              │  │
  │  │  │ 收入     │  │ 错误率  │  │ 告警历史 │              │  │
  │  │  └─────────┘  └─────────┘  └─────────┘              │  │
  │  └──────────────────────────────────────────────────────┘  │
  │                      │                                      │
  │                      ▼                                      │
  │  告警通知层                                                   │
  │  ┌──────────┐ ┌──────────┐ ┌──────────┐                    │
  │  │ 电话告警  │ │ 钉钉/Slack│ │ 邮件告警  │                    │
  │  └──────────┘ └──────────┘ └──────────┘                    │
  └──────────────────────────────────────────────────────────────┘

⚠️ 注意事项

日志采样率——生产环境不要 100% 采集 Trace，10-20% 通常足够
日志量控制——DEBUG 级别日志不要输出到生产环境
敏感信息脱敏——日志中不要包含密码、Token、身份证号等
告警不要太多——告警疲劳比没有告警更危险
关联 TraceID——日志中必须包含 TraceID，方便关联查询

📖 扩展阅读

OpenTelemetry Documentation (opentelemetry.io) — 可观测性标准
Prometheus Documentation (prometheus.io) — 指标监控标准
Grafana Documentation (grafana.com) — 可视化最佳实践
Google SRE Book — Golden Signals 定义
Observability Engineering — Charity Majors — O’Reilly 可观测性工程

本章小结

支柱	工具	核心作用
指标 (Metrics)	Prometheus + Grafana	趋势监控、告警
日志 (Logs)	ELK / Loki	排错、审计
链路追踪 (Traces)	Jaeger / Tempo	调用链分析、瓶颈定位
统一标准	OpenTelemetry	统一采集三大信号

📌 下一章：第 12 章：测试策略 — 契约测试、集成测试、混沌工程。