跳到主要内容

监控运维

完善的监控体系是保障 RocketMQ 集群稳定运行的关键。本章将介绍如何构建 RocketMQ 的监控运维体系。

监控架构概述

RocketMQ 的监控体系主要包括以下几个方面:

RocketMQ Exporter

RocketMQ Exporter 是官方提供的监控指标导出工具,通过 MQAdmin 从 Broker 获取指标数据,并转换为 Prometheus 格式。

部署 Exporter

方式一:Docker 部署

# 拉取镜像
docker pull apache/rocketmq-exporter:latest

# 启动容器
docker run -d \
--name rocketmq-exporter \
-p 5557:5557 \
-e ROCKETMQ_CONFIG_NAMESRVADDR="192.168.1.1:9876;192.168.1.2:9876" \
apache/rocketmq-exporter:latest

方式二:源码编译部署

# 克隆源码
git clone https://github.com/apache/rocketmq-exporter.git
cd rocketmq-exporter

# 编译打包
mvn clean package -Dmaven.test.skip=true

# 启动
java -jar target/rocketmq-exporter-0.0.2-SNAPSHOT.jar

配置说明

application.yml 核心配置:

server:
port: 5557 # Exporter 监听端口

rocketmq:
config:
namesrvAddr: 192.168.1.1:9876;192.168.1.2:9876 # NameServer 地址
webTelemetryPath: /metrics # 指标路径
enableACL: false # 是否开启 ACL
accessKey: "" # ACL AccessKey
secretKey: "" # ACL SecretKey
outOfTimeSeconds: 60 # 指标缓存过期时间

# 定时任务配置
task:
collectTopicOffset:
cron: "15 0/1 * * * ?" # 每分钟采集 Topic Offset
collectConsumerOffset:
cron: "15 0/1 * * * ?" # 每分钟采集 Consumer Offset
collectBrokerStatsTopic:
cron: "15 0/1 * * * ?" # 每分钟采集 Broker Stats

验证部署

# 访问指标接口
curl http://localhost:5557/metrics

# 输出示例
rocketmq_broker_tps{cluster="DefaultCluster",broker="broker-a"} 1234.0
rocketmq_group_diff{group="ConsumerGroupA",topic="TopicTest"} 567.0

Prometheus 配置

安装 Prometheus

# 下载
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz

# 解压
tar -xzf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64

配置 Prometheus

prometheus.yml 配置文件:

global:
scrape_interval: 15s # 采集间隔
evaluation_interval: 15s # 规则评估间隔

scrape_configs:
# RocketMQ Exporter
- job_name: 'rocketmq'
static_configs:
- targets: ['192.168.1.10:5557']
labels:
instance: 'rocketmq-cluster'

# 如果有多个 Exporter
- job_name: 'rocketmq-exporters'
static_configs:
- targets:
- '192.168.1.10:5557'
- '192.168.1.11:5557'

# 告警规则文件
rule_files:
- "alert_rules.yml"

# AlertManager 配置
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']

启动 Prometheus

./prometheus --config.file=prometheus.yml

访问 Prometheus Web UI:http://localhost:9090

核心监控指标

RocketMQ 提供了丰富的监控指标,以下是重点关注的核心指标。

Broker 指标

指标名称含义告警阈值建议
rocketmq_broker_tpsBroker 生产 TPS-
rocketmq_broker_qpsBroker 消费 QPS-
rocketmq_brokeruntime_commitlog_disk_ratioCommitLog 磁盘使用率> 80% 告警
rocketmq_brokeruntime_putmessage_entire_time_max写入消息最大耗时> 100ms 告警
rocketmq_brokeruntime_dispatch_behind_bytes未分发消息字节数> 100MB 告警

消费者指标

指标名称含义告警阈值建议
rocketmq_group_diff消费组消息堆积数> 10000 告警
rocketmq_group_retrydiff重试队列堆积数> 1000 告警
rocketmq_group_dlqdiff死信队列堆积数> 100 告警
rocketmq_client_consume_fail_msg_tps消费失败 TPS> 0 告警
rocketmq_client_consume_rt消费耗时> 1000ms 告警

生产者指标

指标名称含义告警阈值建议
rocketmq_producer_tpsTopic 生产 TPS-
rocketmq_producer_message_size生产消息大小-
rocketmq_producer_offsetTopic 最大 Offset-

Grafana 可视化

安装 Grafana

# Docker 方式
docker run -d \
--name grafana \
-p 3000:3000 \
grafana/grafana:latest

配置数据源

  1. 访问 Grafana:http://localhost:3000
  2. 默认账号:admin/admin
  3. 添加 Prometheus 数据源:
    • URL:http://prometheus:9090
    • 访问模式:Server

导入 Dashboard

RocketMQ 官方提供了 Grafana Dashboard 模板:

# Dashboard ID: 10477
# 导入方式:Grafana -> Dashboards -> Import -> 输入 10477

自定义 Dashboard 面板

Broker 概览面板

{
"title": "Broker TPS/QPS",
"type": "graph",
"targets": [
{
"expr": "sum(rocketmq_broker_tps) by (broker)",
"legendFormat": "{{broker}} TPS"
},
{
"expr": "sum(rocketmq_broker_qps) by (broker)",
"legendFormat": "{{broker}} QPS"
}
]
}

消息堆积监控面板

{
"title": "消息堆积监控",
"type": "graph",
"targets": [
{
"expr": "rocketmq_group_diff",
"legendFormat": "{{group}} - {{topic}}"
}
],
"alert": {
"conditions": [
{
"evaluator": {
"type": "gt",
"params": [10000]
}
}
]
}
}

告警规则配置

Prometheus 告警规则

alert_rules.yml 文件:

groups:
- name: rocketmq-alerts
rules:
# 消息堆积告警
- alert: RocketMQMessageAccumulation
expr: rocketmq_group_diff > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "消息堆积告警"
description: "消费组 {{ $labels.group }} 在 Topic {{ $labels.topic }} 上堆积了 {{ $value }} 条消息"

# 磁盘使用率告警
- alert: RocketMQDiskUsageHigh
expr: rocketmq_brokeruntime_commitlog_disk_ratio > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Broker 磁盘使用率过高"
description: "Broker {{ $labels.broker }} 磁盘使用率 {{ $value | humanizePercentage }}"

# 消费失败告警
- alert: RocketMQConsumeFailed
expr: sum(rate(rocketmq_client_consume_fail_msg_tps[5m])) by (group) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "消息消费失败"
description: "消费组 {{ $labels.group }} 出现消费失败,失败 TPS: {{ $value }}"

# Broker 不可用告警
- alert: RocketMQBrokerDown
expr: up{job="rocketmq"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Broker 不可用"
description: "Broker {{ $labels.instance }} 已宕机超过 1 分钟"

# 主从同步延迟告警
- alert: RocketMQReplicationLag
expr: rocketmq_broker_commitlog_diff > 104857600
for: 5m
labels:
severity: warning
annotations:
summary: "主从同步延迟过高"
description: "Broker 主从同步延迟 {{ $value | humanize1024 }}B"

AlertManager 配置

alertmanager.yml 文件:

global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:25'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'password'

route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-receiver'
- match:
severity: warning
receiver: 'warning-receiver'

receivers:
- name: 'default-receiver'
email_configs:
- to: '[email protected]'

- name: 'critical-receiver'
email_configs:
- to: '[email protected]'
webhook_configs:
- url: 'http://webhook.example.com/alert'

- name: 'warning-receiver'
email_configs:
- to: '[email protected]'

启动 AlertManager

./alertmanager --config.file=alertmanager.yml

运维操作

Topic 管理

# 创建 Topic
sh bin/mqadmin updateTopic -n localhost:9876 -c DefaultCluster -t TopicTest -r 4 -w 4

# 查看 Topic 列表
sh bin/mqadmin topicList -n localhost:9876

# 查看 Topic 路由
sh bin/mqadmin topicRoute -n localhost:9876 -t TopicTest

# 查看 Topic 统计
sh bin/mqadmin topicStatus -n localhost:9876 -t TopicTest

# 删除 Topic
sh bin/mqadmin deleteTopic -n localhost:9876 -c DefaultCluster -t TopicTest

消费组管理

# 查看消费组列表
sh bin/mqadmin consumerList -n localhost:9876

# 查看消费进度
sh bin/mqadmin consumerProgress -n localhost:9876 -g ConsumerGroupA

# 重置消费位点到最新
sh bin/mqadmin resetOffsetByTime -n localhost:9876 -g ConsumerGroupA -t TopicTest -s -1

# 重置消费位点到指定时间
sh bin/mqadmin resetOffsetByTime -n localhost:9876 -g ConsumerGroupA -t TopicTest -s "20240101120000"

消息查询

# 按 Key 查询消息
sh bin/mqadmin queryMsgByKey -n localhost:9876 -t TopicTest -k messageKey

# 按 MessageId 查询消息
sh bin/mqadmin queryMsgById -n localhost:9876 -i messageId

# 按 Offset 查询消息
sh bin/mqadmin queryMsgByOffset -n localhost:9876 -t TopicTest -b broker-a -i 0 -o 1000

集群状态检查

# 查看集群信息
sh bin/mqadmin clusterList -n localhost:9876

# 查看 Broker 状态
sh bin/mqadmin brokerStatus -n localhost:9876 -b broker-a

# 查看所有统计信息
sh bin/mqadmin statsAll -n localhost:9876

Dashboard 管理控制台

RocketMQ Dashboard 是官方提供的 Web 管理控制台,可以可视化管理集群。

Docker 部署

docker run -d \
--name rocketmq-dashboard \
-p 8180:8080 \
-e "JAVA_OPTS=-Drocketmq.namesrv.addr=192.168.1.1:9876" \
apacherocketmq/rocketmq-dashboard:latest

功能概览

运维最佳实践

1. 监控指标分级

级别指标告警方式
P0Broker 宕机、磁盘满电话 + 短信
P1消息堆积严重、主从断开短信 + 邮件
P2磁盘使用率高、消费延迟邮件
P3性能下降、异常日志通知

2. 定期巡检清单

  • 检查磁盘使用率是否超过 70%
  • 检查消息堆积是否正常
  • 检查主从同步是否正常
  • 检查异常错误日志
  • 检查 JVM 内存使用情况
  • 检查网络连接状态

3. 故障排查流程

4. 日志管理

# 日志路径
~/logs/rocketmqlogs/

# 关键日志文件
broker.log # Broker 运行日志
namesrv.log # NameServer 运行日志
store.log # 存储相关日志
stats.log # 统计日志

# 日志级别配置
# 修改 conf/logback_broker.xml
# 修改 conf/logback_namesrv.xml

小结

本章介绍了 RocketMQ 的监控运维体系:

  1. Exporter 部署:收集 Broker、Producer、Consumer 指标
  2. Prometheus 配置:存储和查询监控数据
  3. Grafana 可视化:展示监控大盘
  4. 告警规则:配置告警阈值和通知渠道
  5. 运维操作:Topic、消费组、消息管理命令
  6. 最佳实践:监控分级、定期巡检、故障排查

完善的监控运维体系是保障 RocketMQ 集群稳定运行的关键。

延伸阅读