集群重启失败

环境

  • EMQX 版本:5.7.0开源版
  • 操作系统版本:Linux version 4.19.90-89.11.v2401.ky10.x86_64 (root@localhost.localdomain) (gcc version 7.3.0 (GCC)) #1 SMP Tue May 7 18:33:01 CST 2024

config.env配置文件:

xport EMQX_NODE__NAME=emqx@192.9.200.35
export EMQX_NODE__COOKIE=emqxsecretcookie-cluster1
export EMQX_CLUSTER__NAME=emqx-cluster1
export EMQX_CLUSTER__STATIC__SEEDS=[emqx@192.168.78.36,emqx@192.168.78.44,emqx@192.168.78.45,emqx@192.9.200.31,emqx@192.9.200.34]
export EMQX_NODE__DATA_DIR=/root/emqx/emqx-data
export EMQX_CLUSTER__DISCOVERY_STRATEGY=static
## 限制Erlang虚拟机的进程数量
export EMQX_NODE__PROCESS_LIMIT=2000000
## 设置 Erlang 系统同时存在的最大端口数
export EMQX_NODE__MAX_PORTS=2097152
export EMQX_LISTENERS__TCP__DEFAULT__ENABLE=true
export EMQX_LISTENERS__TCP__DEFAULT__BIND=0.0.0.0:1883
## 设置 TCP 监听器的 Acceptors 池大小
export EMQX_LISTENERS__TCP__DEFAULT__ACCEPTORS=64
## 设置 TCP 监听器的最大连接数
export EMQX_LISTENERS__TCP__DEFAULT__MAX_CONNECTIONS=1024000
export EMQX_LISTENERS__SSL__DEFAULT__ENABLE=false
export EMQX_LISTENERS__SSL__DEFAULT__BIND=0.0.0.0:8883
export EMQX_LISTENERS__WS__DEFAULT__ENABLE=true
export EMQX_LISTENERS__WS__DEFAULT__BIND=0.0.0.0:8083
## 设置 WS 监听器的 Acceptors 池大小
export EMQX_LISTENERS__WS__DEFAULT__ACCEPTORS=64
## 设置 WS 监听器的最大连接数
export EMQX_LISTENERS__WS__DEFAULT__MAX_CONNECTIONS=1024000
export EMQX_LISTENERS__WSS__DEFAULT__ENABLE=false
export EMQX_LISTENERS__WSS__DEFAULT__BIND=0.0.0.0:8084
export EMQX_DASHBOARD__LISTENERS__HTTP__ENABLE=true
export EMQX_DASHBOARD__LISTENERS__HTTP__BIND=0.0.0.0:18083
export EMQX_DEFAULT_LOG_HANDLER=file
export EMQX_LOG__FILE__ENABLE=true
export EMQX_LOG__FILE__FORMATTER=text
export EMQX_LOG__FILE__LEVEL=info
export EMQX_LOG__FILE__PATH=logs/emqx-log
export EMQX_LOG__FILE__ROTATION_COUNT=30
export EMQX_LOG__FILE__ROTATION_SIZE=30MB
export EMQX_LOG__FILE__TIMESTAMP_FORMAT=rfc3339
## 设置 MQTT 报文最大长度
export EMQX_MQTT__MAX_PACKET_SIZE=5MB
## 设置飞行窗口最大长度,即每个客户端的最大未确认消息数
export EMQX_MQTT__MAX_INFLIGHT=100
## 设置消息队列长度
export EMQX_MQTT__MAX_MQUEUE_LEN=10000
## 设置 EMQX 能够同时处理的QoS 2消息数
export EMQX_MQTT__MAX_AWAITING_REL=1000

重现此问题的步骤

  1. 准备6台虚拟机
  2. 在6台虚拟机上运行,运行步骤
    2.1. source config.env
    2.2. 执行emqx start启动节点
  3. 集群组建成功
  4. 逐个虚拟机执行emqx stop关闭节点
  5. 关闭后,再次执行emqx start命令启动节点

预期行为

节点重启成功,集群重新组建成功

实际行为

节点启动失败,报以下错误:
ERROR: EMQX 5.7.0 using node name ‘emqx@192.168.78.45’ failed 120 probes.
ERROR: emqx@192.168.78.45 node is started, but failed to complete the boot sequence in time.
Shutting down emqx@192.168.78.45 from to_erl pipe.
Attaching to //root/emqx/emqx-data/root_erl_pipes/emqx@192.168.78.45/erlang.pipe.1 (^D to exit)

[EOF]

有个原则:先停的得先启动,顺序很重要

我就是先停的先启动的,不是启动顺序的问题,在日志文件中,有以下警告日志:
2025-06-27T22:27:22.977970+08:00 [warning] mria_mnesia: still waiting for table(s): [‘$mria_rlog_sync’]
2025-06-27T22:27:23.112598+08:00 [warning] Check down_nodes should get but got [‘emqx@192.168.78.36’, ‘emqx@192.9.200.35’, ‘emqx@192.9.200.34’, ‘emqx@192.9.200.31’, ‘emqx@192.168.78.44’], Check check_open_ports should get ok but got #{msg =>, “some ports are unreachable”, results =>, #{‘emqx@192.168.78.36’ =>, #{status => bad_ports, resolved_ips =>, [{192,168,78, 36}], ports_to_check =>, [5370], open_ports =>, #{5370 =>, false}}, ‘emqx@192.168.78.44’ =>, #{status => bad_ports, resolved_ips =>, [{192,168,78, 44}], ports_to_check =>, [5370], open_ports =>, #{5370 =>, false}}, ‘emqx@192.9.200.31’ =>, #{status => bad_ports, resolved_ips =>, [{192,9,200, 31}], ports_to_check =>, [5370], open_ports =>, #{5370 =>, false}}, ‘emqx@192.9.200.34’ =>, #{status => bad_ports, resolved_ips =>, [{192,9,200, 34}], ports_to_check =>, [5370], open_ports =>, #{5370 =>, false}}, ‘emqx@192.9.200.35’ =>, #{status => bad_ports, resolved_ips =>, [{192,9,200, 35}], ports_to_check =>, [5370], open_ports =>, #{5370 =>, false}}}}, Table ‘$mria_rlog_sync’ is waiting for one of the nodes: [‘emqx@192.168.78.36’, ‘emqx@192.168.78.44’, ‘emqx@192.9.200.34’, ‘emqx@192.9.200.35’, ‘emqx@192.9.200.31’]

从警告日志看是因为连不上其他节点的原因,但是这个时候所有节点已经停止,连不上其他节点不是正常情况吗?

我发现集群所有节点都停了之后,再重新启动节点时,节点会尝试去连接其他所有下线节点的5370端口,直到所有下线节点都连接成功,emqx start才会返回成功信息,这意味着我不能停止这个设计合理吗?这样设计的目的是什么?

这个日志就是启动顺序的问题,他在等其他节点起来。拿到数据 自己再进行下一步