集群某一节点掉线,其他节点无法接入新客户端

A、B两个节点均为EMQX 5.3.0版本
集群方式:手动集群,节点B加入节点A

节点B断开网络后大约3分钟之内,在此期间出现的现象:
1.A、B节点的dashboard均无法进行访问
2.未掉线的A节点无法连接新客户端,使用paho和mqttx测试均无法接入A节点(在此之前已经连接A节点的旧客户端能够正常运行收发消息)

发生这些情况是否正常?客户端只能等3分钟之后再进行连接吗?有没有什么方案能够解决这期间客户端无法接入的问题?

理论上:一个节点掉线,不应该影响整个集群的工作状态。如果有影响一般都是 BUG。

需要帮忙从以下几个方面提供下更多信息:

  1. 客户端连不上的原因。a) 客户端是否存在报错 b) 抓包看下链接失败的具体原因是什么。
  2. 看下节点 A 在这个过程中,是否有错误日志产生

1.使用MQTTX连接,客户端未显示错误信息,抓包情况如下:
136为A节点,182为客户端

2.在这个过程中没有错误日志产生

可能需要 debug 日志。emqx ctl log set-level debug

您好,我之前或许操作有误,导致日志文件生成不正常,在我重启EMQX之后日志文件有正常生成
A节点为136,B节点为181
我在重启节点之后(先启动A节点,后启动B节点),进行了两次B节点断网操作
A节点的日志内容如下:
2023-11-07T14:12:42.859400+08:00 [error] msg: failed_to_sync_cluster_conf, mfa: emqx_conf_app:sync_cluster_conf2/1(160), failed: [‘emqx@192.168.1.181’], nodes: [‘emqx@192.168.1.181’], not_ready:
2023-11-07T14:13:11.407209+08:00 [warning] msg: dashboard_monitor_error, mfa: emqx_dashboard_monitor:current_rate/1(144), reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-11-07T14:16:12.439752+08:00 [error] ** Node ‘emqx@192.168.1.181’ not responding **, ** Removing (timedout) connection **
2023-11-07T14:16:42.569945+08:00 [error] Mnesia(‘emqx@192.168.1.136’): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, ‘emqx@192.168.1.181’}
2023-11-07T14:16:42.570070+08:00 [critical] msg: Core cluster partition, mfa: mria_node_monitor:handle_info/2(160), context: running_partitioned_network, from: ‘emqx@192.168.1.181’
2023-11-07T14:16:42.570370+08:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3(418), message: <<“Partition occurs at node emqx@192.168.1.181”>>, name: partition
2023-11-07T14:17:14.097685+08:00 [warning] msg: Stopping mria, mfa: mria:stop/1(134), reason: heal
2023-11-07T14:17:02.017415+08:00 [warning] msg: alarm_is_deactivated, mfa: emqx_alarm:do_actions/3(424), name: partition
2023-11-07T14:17:02.233584+08:00 [critical] msg: Rejoin for autoheal, mfa: mria_autoheal:rejoin/1(154), node: ‘emqx@192.168.1.181’, return: ok
2023-11-07T14:17:16.572894+08:00 [warning] msg: dashboard_monitor_error, mfa: emqx_dashboard_monitor:current_rate/1(144), reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-11-07T14:20:12.447761+08:00 [error] ** Node ‘emqx@192.168.1.181’ not responding **, ** Removing (timedout) connection **
2023-11-07T14:20:55.237330+08:00 [error] Mnesia(‘emqx@192.168.1.136’): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, ‘emqx@192.168.1.181’}
2023-11-07T14:20:55.237349+08:00 [critical] msg: Core cluster partition, mfa: mria_node_monitor:handle_info/2(160), context: running_partitioned_network, from: ‘emqx@192.168.1.181’
2023-11-07T14:20:55.237717+08:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3(418), message: <<“Partition occurs at node emqx@192.168.1.181”>>, name: partition
2023-11-07T14:21:26.756870+08:00 [warning] msg: Stopping mria, mfa: mria:stop/1(134), reason: heal
2023-11-07T14:21:14.733077+08:00 [warning] msg: alarm_is_deactivated, mfa: emqx_alarm:do_actions/3(424), name: partition
2023-11-07T14:21:14.991545+08:00 [critical] msg: Rejoin for autoheal, mfa: mria_autoheal:rejoin/1(154), node: ‘emqx@192.168.1.181’, return: ok
2023-11-07T14:21:30.096462+08:00 [warning] msg: dashboard_monitor_error, mfa: emqx_dashboard_monitor:current_rate/1(144), reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}