服务器关机，但是开机的时候A 、B、C节点重新启动时，C节点报 message=channel_closed driver=tcp socket="#Port<0.16>" action=stopping，请问这个错是如何产生的

5779170 · 2024 年7 月 1 日 11:03

A节点 [warning] mria_mnesia: still waiting for table(s): [‘$mria_rlog_sync’]
C节点 [error] message=channel_closed driver=tcp socket=“#Port<0.16>” action=stopping
B节点 [warning] mria_mnesia: still waiting for table(s): [‘$mria_rlog_sync’]

三个节点开机启动，A、B一直在等C节点同步数据，请问C节点报的错是如何产生的，直到我重启C节点才能恢复，请问C节点的报错是如何产生的？

zhongwencool · 2024 年7 月 2 日 01:11

这 3 条日志很难判断，可以上传一下 ABC 的全部日志看看

5779170 · 2024 年7 月 2 日 01:26

那请问这句话的意思什么呢？message=channel_closed driver=tcp socket=“#Port<0.16>” action=stopping

zhongwencool · 2024 年7 月 2 日 01:30

Port<0.16>的 tcp 连接连不上。

5779170 · 2024 年7 月 2 日 01:42

请问这个是什么的端口呢？

zhongwencool · 2024 年7 月 2 日 03:34

日志上没有更多信息，我也猜不出来呢

5779170 · 2024 年7 月 3 日 01:06

logs.zip (371.5 KB)
以上是日志信息，异常发生的时候是7月1号第一条的数据开始，以下是当天出问题的第一条日志
0.0.0.106节点 2024-07-01T06:26:42.893925+08:00 [warning] mria_mnesia: still waiting for table(s): [‘$mria_rlog_sync’]
0.0.0.73节点 2024-07-01T06:27:12.174773+08:00 [error] message=channel_closed driver=tcp socket=“#Port<0.16>” action=stopping
0.0.0.105节点2024-07-01T06:21:43.236490+08:00 [warning] mria_mnesia: still waiting for table(s): [‘$mria_rlog_sync’]

以上是3个节点从前一天晚上关机后，重新开机时启动失败，从上面来看73节点是主节点，主要问题出现在主节点上一直没法启动，直到我后面在73节点上2024-07-01 06:59:22 ./emqx stop执行这个再重启才能恢复，问题是想了解当是为什么启动不了

zhongwencool · 2024 年7 月 3 日 05:33

从日志分析：

106 在 2024-07-01T06:26:42.893925+08:00 启动，105 在2024-07-01T06:21:43.236490+08:00 启动。启动时发现自己的mnesia 需要从 73 上同步过来，所以一直在等 73 启动。
73 在 2024-07-01T06:27:20.947164+08:00 启动时发现集群里面已经有 105,106 了，所以想从其中一台同步配置上面再启动。但是73 在问 106,105 的时候发现他们都没准备好。所以又在等他们准备。
这样 2 个就在互相等。导致了一直启动不了。
临时的处理的方法是：先把 105,106停掉，然后再启动 73,73正常后，再去启动 105,106.

这个 bug 我们会再跟进，看看怎么修好。

5779170 · 2024 年7 月 3 日 05:47

好的，谢谢

5779170 · 2024 年7 月 3 日 05:56

但我有个疑问，为什么73那台启动时第一条日志输出的是0.0.0.73节点 2024-07-01T06:27:12.174773+08:00 [error] message=channel_closed driver=tcp socket=“#Port<0.16>” action=stopping ？

zhongwencool · 2024 年7 月 3 日 07:46

这个日志是他想连其它机器的 gen_rpc来传mnesia 的内部数据，但是被其它的机器拒绝了。

github.com

emqx/gen_rpc/blob/master/src/gen_rpc_client.erl#L372-L374


      
          handle_info({DriverClosed, Socket}, #state{socket=Socket, driver=Driver, driver_closed=DriverClosed} = State) ->
              ?log(error, "message=channel_closed driver=~s socket=\"~s\" action=stopping", [Driver, gen_rpc_helper:socket_to_string(Socket)]),
              {stop, normal, State};

zhongwencool · 2024 年7 月 3 日 09:19

请问一下，你使用的是 emqx 具体版本号是多少

5779170 · 2024 年7 月 3 日 10:06

5.2.0版本

zhongwencool · 2024 年7 月 3 日 11:03

github.com/emqx/emqx

fix(cluster-rpc): boot from local config if table loaded

committed 02:06PM - 08 Nov 23 UTC

zmstone

+128 -92

When EMQX boots up, it tries to get latest config from peer (core type) nodes, i…f none of the nodes are replying, the node will decide to boot with local config (and replay the committed changes) if the commit table is loaded from disk locally (an indication of the data being latest), otherwise it will sleep for 1-2 seconds and retry. This lead to a race condition, e.g. in a two nodes cluster: 1. node1 boots up 2. node2 boots up and copy mnesia table from node1 3. node1 restart before node2 can sync cluster.hocon from it 4. node1 boots up and copy mnesia table from node2 Now that both node1 and node2 has the mnesia `load_node` pointing to each other (i.e. not a local disk load). Prior to this fix, the nodes would wait for each other in a dead loop. This commit fixes the issue by allowing node to boot with local config if it does not have a lagging.

我们已经在最新版本上修复了这个问题，推荐升级到最新版本。