跳到主要内容

监控管理

Spring Boot Actuator 提供了生产级的监控和管理功能,包括健康检查、指标收集、审计、HTTP 追踪等。本章将详细介绍如何使用和扩展 Actuator。

Actuator 概述

什么是 Actuator?

Actuator 是 Spring Boot 的生产就绪功能模块,提供:

  • 健康检查:应用健康状态监控
  • 指标收集:性能指标、业务指标
  • 端点暴露:通过 HTTP 或 JMX 访问
  • 审计功能:记录重要事件
  • 远程管理:远程配置和调试

添加依赖

<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

内置端点

端点说明默认暴露
health应用健康状态HTTP/JMX
info应用信息HTTP/JMX
beansSpring Bean 列表JMX
conditions自动配置条件报告JMX
configprops配置属性JMX
env环境变量JMX
loggers日志配置JMX
metrics指标信息JMX
mappingsURL 映射JMX
shutdown优雅关闭应用
threaddump线程转储JMX
heapdump堆转储
caches缓存信息JMX
scheduledtasks定时任务JMX

端点配置

暴露端点

management:
endpoints:
web:
exposure:
# 暴露所有端点
include: "*"
# 排除某些端点
exclude: shutdown,heapdump

# JMX 暴露
jmx:
exposure:
include: "*"

推荐做法

management:
endpoints:
web:
exposure:
# 生产环境只暴露必要端点
include: health,info,metrics,prometheus

端点访问控制(Spring Boot 3.4+)

Spring Boot 3.4 引入了更精细的端点访问控制模型,支持只读访问级别:

management:
endpoints:
access:
default: read-only # 默认访问级别

endpoint:
health:
access: unrestricted # 完全访问
loggers:
access: read-only # 只读访问
shutdown:
access: none # 禁用访问

访问级别说明

访问级别说明
none禁用端点访问
read-only只允许读取操作,禁止修改
unrestricted完全访问,允许读取和修改

最大访问权限限制

management:
endpoints:
access:
max-permitted: read-only # 限制所有端点最大访问权限为只读

配置示例:即使 loggers 端点配置为 unrestricted,由于 max-permitted 设置为 read-only,实际只能读取日志级别,无法修改。

兼容旧配置

# 旧配置(已弃用但仍可用)
management:
endpoints:
enabled-by-default: true
endpoint:
health:
enabled: true

# 新配置(推荐)
management:
endpoints:
access:
default: read-only
endpoint:
health:
access: unrestricted

端点安全

management:
endpoints:
web:
base-path: /actuator # 默认路径
exposure:
include: health,info

endpoint:
health:
show-details: when-authorized # 仅授权用户显示详情
# show-details: always # 总是显示
# show-details: never # 从不显示

配合 Spring Security

@Configuration
public class ActuatorSecurityConfig {

@Bean
public SecurityFilterChain actuatorSecurity(HttpSecurity http) throws Exception {
http
.requestMatcher(EndpointRequest.toAnyEndpoint())
.authorizeExchange(auth -> auth
.requestMatchers(EndpointRequest.to("health", "info")).permitAll()
.anyExchange().hasRole("ACTUATOR")
)
.httpBasic(Customizer.withDefaults());
return http.build();
}
}

自定义端点路径

management:
endpoints:
web:
base-path: /management # 修改基础路径

server:
port: 8081 # 使用独立端口
address: 127.0.0.1 # 只允许本地访问

健康检查

基本使用

访问 GET /actuator/health

{
"status": "UP",
"components": {
"db": {
"status": "UP",
"details": {
"database": "MySQL",
"validationQuery": "isValid()"
}
},
"diskSpace": {
"status": "UP",
"details": {
"total": 107374182400,
"free": 53687091200,
"threshold": 10485760,
"exists": true
}
},
"ping": {
"status": "UP"
},
"redis": {
"status": "UP",
"details": {
"version": "7.0.0"
}
}
}
}

健康状态

状态说明
UP正常运行
DOWN服务不可用
OUT_OF_SERVICE服务暂停
UNKNOWN未知状态

自定义健康检查

@Component
public class CustomHealthIndicator implements HealthIndicator {

@Autowired
private ExternalService externalService;

@Override
public Health health() {
try {
// 检查外部服务
if (externalService.isAvailable()) {
return Health.up()
.withDetail("service", "External Service")
.withDetail("responseTime", "100ms")
.build();
} else {
return Health.down()
.withDetail("service", "External Service")
.withDetail("error", "Service unavailable")
.build();
}
} catch (Exception e) {
return Health.down(e)
.withDetail("service", "External Service")
.build();
}
}
}

组合健康检查

@Component
public class DatabaseHealthIndicator implements HealthIndicator {

@Autowired
private DataSource dataSource;

@Override
public Health health() {
try (Connection conn = dataSource.getConnection()) {
if (conn.isValid(1)) {
return Health.up()
.withDetail("database", "MySQL")
.withDetail("validationQuery", "isValid()")
.build();
}
return Health.down().withDetail("error", "Connection invalid").build();
} catch (SQLException e) {
return Health.down(e).build();
}
}
}

健康检查配置

management:
endpoint:
health:
show-details: always
group:
# 自定义健康组
liveness:
include: ping,diskSpace
readiness:
include: db,redis
probes:
enabled: true # 启用 Kubernetes 探针

SSL 健康检查(Spring Boot 3.4+)

Spring Boot 3.4 新增了 SSL 证书健康检查,可以监控证书有效性:

management:
health:
ssl:
enabled: true # 启用 SSL 健康检查
certificate-validity-warning-threshold: 14d # 证书过期警告阈值

健康检查响应示例

{
"status": "UP",
"components": {
"ssl": {
"status": "UP",
"details": {
"validChains": 2,
"invalidChains": 0,
"expiringSoonChains": 0
}
}
}
}

证书即将过期时的响应

{
"status": "OUT_OF_SERVICE",
"components": {
"ssl": {
"status": "OUT_OF_SERVICE",
"details": {
"validChains": 1,
"invalidChains": 1,
"expiringSoonChains": 1,
"expiredCertificates": [
{
"alias": "server-cert",
"expires": "2024-12-15T00:00:00Z"
}
]
}
}
}
}

SSL 信息端点(Spring Boot 3.4+)

SSL 信息会自动显示在 /actuator/info 端点中:

# 启用 SSL 信息(默认启用)
management:
info:
ssl:
enabled: true

访问 /actuator/info 查看 SSL 信息

{
"ssl": {
"bundles": {
"server": {
"certificateChain": [
{
"subject": "CN=example.com, O=Example Inc",
"issuer": "CN=Let's Encrypt Authority X3",
"notBefore": "2024-01-01T00:00:00Z",
"notAfter": "2024-12-31T23:59:59Z",
"daysUntilExpiry": 180
}
]
}
}
}
}

提示:证书即将过期时会在 info 端点显示警告,方便运维人员及时更新证书。

Kubernetes 探针

Spring Boot 2.3+ 支持 Kubernetes 探针:

management:
endpoint:
health:
probes:
enabled: true
health:
livenessstate:
enabled: true
readinessstate:
enabled: true

端点

  • /actuator/health/liveness - 存活探针
  • /actuator/health/readiness - 就绪探针

Kubernetes 配置

livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 10
periodSeconds: 10

readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 5
periodSeconds: 5

指标监控

内置指标

访问 GET /actuator/metrics

{
"names": [
"jvm.memory.max",
"jvm.memory.used",
"jvm.gc.pause",
"process.cpu.usage",
"system.cpu.usage",
"http.server.requests",
"tomcat.threads.busy"
]
}

查看单个指标

GET /actuator/metrics/jvm.memory.used

{
"name": "jvm.memory.used",
"description": "The amount of used memory",
"baseUnit": "bytes",
"measurements": [
{
"statistic": "VALUE",
"value": 123456789
}
],
"availableTags": [
{
"tag": "area",
"values": ["heap", "nonheap"]
},
{
"tag": "id",
"values": ["G1 Survivor Space", "G1 Old Gen", "G1 Eden Space"]
}
]
}

按标签过滤

GET /actuator/metrics/jvm.memory.used?tag=area:heap

自定义指标

@Service
@RequiredArgsConstructor
public class OrderService {

private final MeterRegistry meterRegistry;
private final Counter orderCounter;
private final Timer orderTimer;

public OrderService(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;

// 订单计数器
this.orderCounter = Counter.builder("orders.created")
.description("Total orders created")
.tag("type", "online")
.register(meterRegistry);

// 订单处理计时器
this.orderTimer = Timer.builder("orders.processing.time")
.description("Order processing time")
.register(meterRegistry);
}

public Order createOrder(OrderDTO dto) {
return orderTimer.record(() -> {
// 处理订单
Order order = processOrder(dto);

// 增加计数
orderCounter.increment();

return order;
});
}
}

指标类型

类型说明使用场景
Counter只增不减的计数器请求数、错误数
Gauge可增可减的值当前连接数、队列大小
Timer计时统计请求耗时
DistributionSummary分布统计请求大小分布

示例

@Service
public class MetricsService {

private final MeterRegistry registry;

// Counter:计数器
private final Counter requestCounter;

// Gauge:实时值
private final AtomicInteger activeConnections = new AtomicInteger(0);

// Timer:计时器
private final Timer requestTimer;

// DistributionSummary:分布统计
private final DistributionSummary requestSize;

public MetricsService(MeterRegistry registry) {
this.registry = registry;

// 创建 Counter
this.requestCounter = Counter.builder("app.requests")
.description("Total requests")
.tag("endpoint", "/api/orders")
.register(registry);

// 创建 Gauge
registry.gauge("app.connections.active", activeConnections);

// 创建 Timer
this.requestTimer = Timer.builder("app.request.duration")
.description("Request duration")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);

// 创建 DistributionSummary
this.requestSize = DistributionSummary.builder("app.request.size")
.description("Request size in bytes")
.baseUnit("bytes")
.register(registry);
}

public void incrementRequest() {
requestCounter.increment();
}

public void recordRequestDuration(long millis) {
requestTimer.record(millis, TimeUnit.MILLISECONDS);
}

public void recordRequestSize(long bytes) {
requestSize.record(bytes);
}

public void connectionAdded() {
activeConnections.incrementAndGet();
}

public void connectionRemoved() {
activeConnections.decrementAndGet();
}
}

HTTP 请求指标

自动收集 HTTP 请求指标:

management:
metrics:
web:
server:
request:
autotime:
enabled: true
percentiles: 0.5,0.95,0.99

Prometheus 集成

添加依赖

<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

配置暴露端点

management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus

metrics:
tags:
application: ${spring.application.name} # 添加应用标签

访问 Prometheus 端点

GET /actuator/prometheus

# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="G1 Old Gen",} 1.23456789E8
jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 5.67890123E7

Prometheus 配置

# prometheus.yml
scrape_configs:
- job_name: 'spring-boot'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['localhost:8080']

Grafana 可视化

使用 Grafana 展示指标:

  1. 添加 Prometheus 数据源
  2. 导入 Spring Boot Dashboard(ID: 12900)
  3. 自定义监控面板

应用信息

配置应用信息

info:
app:
name: @project.name@
version: @project.version@
description: @project.description@
java:
version: @java.version@
author: 张三
contact: [email protected]

访问 GET /actuator/info

{
"app": {
"name": "myapp",
"version": "1.0.0",
"description": "My Application",
"java": {
"version": "17"
}
},
"author": "张三",
"contact": "[email protected]"
}

Git 信息

添加 git-commit-id-plugin

<plugin>
<groupId>io.github.git-commit-id</groupId>
<artifactId>git-commit-id-maven-plugin</artifactId>
<version>6.0.0</version>
<executions>
<execution>
<goals>
<goal>revision</goal>
</goals>
</execution>
</executions>
</plugin>

启用 Git 信息:

management:
info:
git:
mode: full

自定义端点

创建自定义端点

@Component
@Endpoint(id = "custom")
public class CustomEndpoint {

@ReadOperation
public Map<String, Object> info() {
Map<String, Object> info = new HashMap<>();
info.put("timestamp", System.currentTimeMillis());
info.put("status", "running");
return info;
}

@ReadOperation
public Map<String, Object> detail(@Selector String name) {
Map<String, Object> detail = new HashMap<>();
detail.put("name", name);
detail.put("value", "detail value");
return detail;
}

@WriteOperation
public void update(@Selector String name, @Nullable String value) {
// 更新操作
}

@DeleteOperation
public void delete(@Selector String name) {
// 删除操作
}
}

访问:

  • GET /actuator/custom - 调用 info()
  • GET /actuator/custom/myname - 调用 detail("myname")

Web 端点扩展

@Component
@WebEndpoint(id = "customweb")
public class CustomWebEndpoint {

@ReadOperation
public WebEndpointResponse<Map<String, Object>> info() {
Map<String, Object> data = new HashMap<>();
data.put("message", "Hello from custom endpoint");
return new WebEndpointResponse<>(data, HttpStatus.OK.value());
}
}

控制器端点

@Component
@ControllerEndpoint(id = "customcontroller")
public class CustomControllerEndpoint {

@GetMapping("/hello")
@ResponseBody
public String hello(@RequestParam String name) {
return "Hello, " + name;
}
}

访问:GET /actuator/customcontroller/hello?name=World

审计功能

配置审计

management:
audit:
events:
enabled: true

自定义审计事件

@Configuration
public class AuditConfig {

@Bean
public AuditEventRepository auditEventRepository() {
return new InMemoryAuditEventRepository();
}
}

@Service
@RequiredArgsConstructor
public class UserService {

private final AuditEventRepository auditRepository;

public void login(String username, boolean success) {
auditRepository.add(new AuditEvent(
username,
"AUTHENTICATION",
success ? "SUCCESS" : "FAILURE"
));
}
}

最佳实践

1. 安全配置

management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
endpoint:
health:
show-details: when-authorized

2. 独立端口

management:
server:
port: 8081
address: 127.0.0.1

3. 指标标签

management:
metrics:
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active}

4. 监控告警

结合 Prometheus Alertmanager:

groups:
- name: spring-boot
rules:
- alert: HighErrorRate
expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) > 0.1
for: 5m
annotations:
summary: "High error rate detected"

可观测性

可观测性(Observability)是从外部观察运行系统内部状态的能力。它由三大支柱组成:日志(Logging)、指标(Metrics)和追踪(Traces)。Spring Boot 通过 Micrometer 提供了完整的可观测性支持。

可观测性三大支柱

支柱说明解决的问题
日志记录离散事件发生了什么?什么时候?
指标聚合的数值测量系统状态如何?趋势是什么?
追踪请求的完整路径请求经过了哪些服务?耗时分布?

Micrometer Observation API

Spring Boot 使用 Micrometer Observation API 统一处理指标和追踪。

创建自定义观测

import io.micrometer.observation.Observation;
import io.micrometer.observation.ObservationRegistry;
import org.springframework.stereotype.Component;

@Component
public class OrderService {

private final ObservationRegistry observationRegistry;

public OrderService(ObservationRegistry observationRegistry) {
this.observationRegistry = observationRegistry;
}

public Order createOrder(OrderDTO dto) {
// 创建观测点,自动生成指标和追踪
return Observation.createNotStarted("order.create", observationRegistry)
.lowCardinalityKeyValue("type", dto.getType()) // 低基数标签:加入指标和追踪
.highCardinalityKeyValue("userId", dto.getUserId()) // 高基数标签:仅加入追踪
.observe(() -> {
// 实际业务逻辑
return doCreateOrder(dto);
});
}
}

标签基数说明

  • 低基数标签:取值范围有限,如 typestatusmethod。会同时添加到指标和追踪中。
  • 高基数标签:取值范围无限,如 userIdorderId。只添加到追踪中,避免指标爆炸。

观测生命周期

// 方式一:使用 observe() 方法(推荐)
Observation.createNotStarted("my.operation", observationRegistry)
.lowCardinalityKeyValue("key", "value")
.observe(() -> {
// 业务逻辑
});

// 方式二:手动控制生命周期
Observation observation = Observation.createNotStarted("my.operation", observationRegistry)
.lowCardinalityKeyValue("key", "value")
.start();

try {
// 业务逻辑
observation.event(Observation.Event.of("step1", "第一步完成"));
// 更多业务逻辑
} catch (Exception e) {
observation.error(e); // 记录错误
throw e;
} finally {
observation.stop(); // 必须停止
}

自定义观测约定

// 定义观测约定
public class OrderObservationConvention implements GlobalObservationConvention<OrderObservationContext> {

@Override
public String getName() {
return "order.process";
}

@Override
public String getContextualName(OrderObservationContext context) {
return "order-" + context.getOrderType();
}

@Override
public KeyValues getLowCardinalityKeyValues(OrderObservationContext context) {
return KeyValues.of("order.type", context.getOrderType());
}
}

// 自定义观测上下文
public class OrderObservationContext extends Observation.Context {
private String orderType;

public String getOrderType() {
return orderType;
}

public void setOrderType(String orderType) {
this.orderType = orderType;
}
}

分布式追踪

分布式追踪用于跟踪请求在微服务架构中的完整调用链路,帮助定位性能瓶颈和故障。

追踪原理

┌─────────────────────────────────────────────────────────────────────┐
│ 分布式追踪工作原理 │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 服务 A ──────────────────> 服务 B ──────────────────> 服务 C │
│ │ │ │ │
│ TraceId: abc123 TraceId: abc123 TraceId: abc123
│ SpanId: span1 SpanId: span2 SpanId: span3
│ ParentSpanId: - ParentSpanId: span1 ParentSpanId: span2
│ │
│ TraceId:整个请求链路的唯一标识,在所有服务间传递 │
│ SpanId:单个服务处理的标识,每个服务生成新的 SpanId │
│ ParentSpanId:父 Span 的标识,用于构建调用链路树 │
│ │
└─────────────────────────────────────────────────────────────────────┘

添加追踪依赖

OpenTelemetry + OTLP(推荐)

<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-opentelemetry</artifactId>
</dependency>

OpenTelemetry + Zipkin

<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-zipkin</artifactId>
</dependency>

Brave + Zipkin

<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-zipkin</artifactId>
</dependency>

追踪配置

# 追踪采样率配置
management:
tracing:
sampling:
probability: 1.0 # 采样率 100%(生产环境建议 0.1 或更低)

# Baggage 配置
baggage:
remote-fields: userId,tenantId # 跨服务传递的 baggage
correlation:
fields: userId,tenantId # 添加到 MDC 的 baggage

# OpenTelemetry OTLP 配置
opentelemetry:
tracing:
export:
enabled: true
otlp:
endpoint: http://localhost:4318/v1/traces
transport: http # 或 grpc

日志关联 ID

追踪 ID 会自动添加到日志中,方便关联日志和追踪:

logging:
pattern:
correlation: "[%X{traceId:-},%X{spanId:-}] "

日志输出示例

2024-01-15 10:30:00.000 [abc123,span1] INFO  c.e.UserService - 用户登录成功

追踪传播

使用自动配置的 HTTP 客户端构建器,追踪信息会自动传播:

@Service
public class RemoteService {

// 推荐:使用自动配置的构建器
private final RestClient restClient;

public RemoteService(RestClient.Builder restClientBuilder) {
this.restClient = restClientBuilder
.baseUrl("http://remote-service")
.build();
}

public User getUser(Long id) {
// 追踪信息自动传播到远程服务
return restClient.get()
.uri("/users/{id}", id)
.retrieve()
.body(User.class);
}
}

注意:如果直接创建 RestTemplateRestClientWebClient,追踪信息不会自动传播。

创建自定义 Span

import io.micrometer.tracing.Tracer;
import io.micrometer.tracing.Span;
import org.springframework.stereotype.Service;

@Service
public class PaymentService {

private final Tracer tracer;

public PaymentService(Tracer tracer) {
this.tracer = tracer;
}

public void processPayment(PaymentRequest request) {
// 创建新的 Span
Span span = tracer.nextSpan().name("payment.process");

try (Tracer.SpanInScope ws = tracer.withSpan(span.start())) {
span.tag("payment.method", request.getMethod());
span.event("payment.started");

// 业务逻辑
doPayment(request);

span.event("payment.completed");
} catch (Exception e) {
span.tag("error", true);
span.event("payment.failed: " + e.getMessage());
throw e;
} finally {
span.end();
}
}
}

Baggage 使用

Baggage 用于在追踪链路中传递上下文信息:

import io.micrometer.tracing.Tracer;
import io.micrometer.tracing.BaggageInScope;
import org.springframework.stereotype.Service;

@Service
public class TenantService {

private final Tracer tracer;

public TenantService(Tracer tracer) {
this.tracer = tracer;
}

public void processWithTenant(String tenantId) {
// 创建 baggage,自动传播到下游服务
try (BaggageInScope baggage = tracer.createBaggageInScope("tenantId", tenantId)) {
// 在这个作用域内,tenantId 会传播到所有下游调用
doSomething();
}
}

public String getCurrentTenant() {
// 获取当前 baggage
return tracer.getBaggage("tenantId").get();
}
}

集成 Zipkin

启动 Zipkin

# 使用 Docker 启动 Zipkin
docker run -d -p 9411:9411 openzipkin/zipkin

# 或下载 JAR 直接运行
curl -sSL https://zipkin.io/quickstart.sh | bash -s
java -jar zipkin.jar

配置 Spring Boot

spring:
application:
name: my-service

management:
tracing:
sampling:
probability: 1.0
zipkin:
tracing:
endpoint: http://localhost:9411/api/v2/spans

访问 Zipkin UI

打开 http://localhost:9411,可以查看追踪信息:

  • 服务依赖图
  • 请求调用链
  • 各 Span 耗时分布

集成 Jaeger

启动 Jaeger

docker run -d --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest

配置 Spring Boot

management:
tracing:
sampling:
probability: 1.0
otlp:
tracing:
endpoint: http://localhost:4318/v1/traces
transport: http

访问 Jaeger UI

打开 http://localhost:16686,查看追踪信息。

Grafana LGTM 集成

LGTM(Loki + Grafana + Tempo + Mimir)是一个完整的可观测性技术栈:

# 使用 Grafana LGTM 容器
docker run -d --name lgtm \
-p 3000:3000 \
-p 4317:4317 \
-p 4318:4318 \
grafana/otel-lgtm

配置 Spring Boot:

management:
tracing:
sampling:
probability: 1.0
otlp:
tracing:
endpoint: http://localhost:4317
transport: grpc
logging:
endpoint: http://localhost:4317

访问 Grafana(http://localhost:3000)可以查看:

  • 日志
  • 指标
  • 追踪
  • 三者关联分析

OTLP 日志导出(Spring Boot 3.4+)

Spring Boot 3.4 新增了对 OTLP 日志导出的完整支持,可以将日志发送到 OpenTelemetry Collector:

management:
otlp:
logging:
endpoint: http://localhost:4318/v1/logs # OTLP 日志端点
transport: http # 使用 HTTP 传输
# transport: grpc # 或使用 gRPC 传输
connect-timeout: 5s # 连接超时
tracing:
transport: grpc # 追踪使用 gRPC
connect-timeout: 5s

启用/禁用日志导出

management:
otlp:
logging:
export:
enabled: true # 启用日志导出(默认 true)

完整 OTLP 配置示例

management:
tracing:
sampling:
probability: 1.0

otlp:
# 日志配置
logging:
endpoint: http://otel-collector:4318/v1/logs
transport: http
connect-timeout: 5s
export:
enabled: true

# 追踪配置
tracing:
endpoint: http://otel-collector:4318/v1/traces
transport: http
connect-timeout: 5s
export:
enabled: true

OpenTelemetry Collector 配置示例

# otel-collector-config.yaml
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317

exporters:
loki:
endpoint: http://loki:3100/loki/api/v1/push
tempo:
endpoint: tempo:4317
tls:
insecure: true

service:
pipelines:
logs:
receivers: [otlp]
exporters: [loki]
traces:
receivers: [otlp]
exporters: [tempo]

Docker Compose 完整示例

version: '3.8'
services:
app:
build: .
environment:
- MANAGEMENT_OTLP_LOGGING_ENDPOINT=http://otel-collector:4318/v1/logs
- MANAGEMENT_OTLP_TRACING_ENDPOINT=http://otel-collector:4318/v1/traces
depends_on:
- otel-collector

otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317"
- "4318:4318"

loki:
image: grafana/loki:latest
ports:
- "3100:3100"

tempo:
image: grafana/tempo:latest
ports:
- "3200:3200"

grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true

测试追踪

测试时追踪组件不会自动配置。如需测试追踪:

@SpringBootTest
@Import(TestObservationConfig.class)
class TracingTest {

@Autowired
private ObservationRegistry observationRegistry;

@Test
void testObservation() {
// 使用测试配置的 ObservationRegistry
Observation.createNotStarted("test.observation", observationRegistry)
.observe(() -> {
// 测试逻辑
});
}
}

@Configuration
class TestObservationConfig {

@Bean
ObservationRegistry observationRegistry() {
ObservationRegistry registry = ObservationRegistry.create();
registry.observationConfig().observationHandler(new SimpleObservationHandler());
return registry;
}
}

可观测性最佳实践

1. 合理命名观测

// 推荐:使用点分隔的层级命名
Observation.createNotStarted("order.create", observationRegistry)
Observation.createNotStarted("order.payment.process", observationRegistry)
Observation.createNotStarted("db.query.users.findById", observationRegistry)

// 不推荐:随意命名
Observation.createNotStarted("创建订单", observationRegistry)
Observation.createNotStarted("doSomething", observationRegistry)

2. 谨慎使用标签

// 推荐:低基数标签
.lowCardinalityKeyValue("status", "success") // 取值有限:success, failed
.lowCardinalityKeyValue("method", "credit") // 取值有限:credit, debit

// 不推荐:高基数标签用于指标
.lowCardinalityKeyValue("orderId", "12345") // 取值无限,会导致指标爆炸

// 正确:高基数标签只用于追踪
.highCardinalityKeyValue("orderId", "12345") // 只在追踪中使用

3. 采样率配置

# 开发环境:全量采样
management:
tracing:
sampling:
probability: 1.0

# 生产环境:低采样率
management:
tracing:
sampling:
probability: 0.1 # 10% 采样

4. 敏感信息处理

// 不要在追踪中记录敏感信息
// 错误
.highCardinalityKeyValue("password", request.getPassword())

// 正确
.highCardinalityKeyValue("hasPassword", String.valueOf(request.getPassword() != null))

5. 异常追踪

Observation observation = Observation.createNotStarted("my.operation", observationRegistry)
.start();

try {
// 业务逻辑
} catch (BusinessException e) {
// 记录业务异常
observation.lowCardinalityKeyValue("error.type", "business");
observation.highCardinalityKeyValue("error.code", e.getCode());
throw e;
} catch (Exception e) {
// 记录系统异常
observation.error(e); // 自动记录异常堆栈
throw e;
} finally {
observation.stop();
}

小结

本章我们学习了:

  1. Actuator 概述:了解内置端点
  2. 端点配置:暴露、安全、路径配置
  3. 健康检查:自定义健康指示器、Kubernetes 探针
  4. 指标监控:内置指标、自定义指标
  5. Prometheus 集成:与监控系统集成
  6. 可观测性:三大支柱与 Micrometer Observation API
  7. 分布式追踪:追踪原理、OpenTelemetry、Zipkin、Jaeger 集成
  8. 自定义端点:扩展监控能力
  9. 最佳实践:安全、性能、告警

练习

  1. 配置健康检查端点,显示数据库连接状态
  2. 创建自定义健康指示器
  3. 添加自定义业务指标和观测
  4. 集成 Prometheus 和 Grafana
  5. 配置分布式追踪并查看调用链路
  6. 配置 Kubernetes 探针

参考资源