EMQX5.4.0搭建的集群,节点宕机后无法重新拉起

测试环境搭建了两节点集群,其中一个节点由于服务器原因宕机后,过了几天就无法重启启动,一直失败,但是更改集群和节点名,就可以启动了:
报错:Kernel pid terminated
看日志是这个错:Failed to merge schema: {aborted,function_clause}

详细信息:

考虑到上生产的话,这种宕机无法拉起,需要重新搭建集群存在风险,希望了解原因和找到解决方案

你好,请问一下,

  1. 用的集群配置是是什么
./bin/emqx ctl conf show cluster
  1. 能提供一个完整的日志文件么?

1.集群配置:使用的是manual,节点加入方式

2.日志文件无法down下来,具体内容可以发出来:

2024-01-10T15:44:25.309559+08:00 [error] Mnesia(‘emqxcl@10.2.4.36’): ** ERROR ** (core dumped to file: “/usr/soft/mqtt/emqx/MnesiaCore.emqxcl@10.2.4.36_1704_872665_308638”), ** FATAL ** Failed to merge schema: {aborted,function_clause}
2024-01-10T15:44:35.310250+08:00 [error] Generic server mnesia_recover terminating. Reason: killed. Last message: {‘EXIT’,<0.2189.0>,killed}. State: {state,<0.2189.0>,undefined,undefined,undefined,0,false,true,}.
2024-01-10T15:44:35.311058+08:00 [error] Generic server mnesia_subscr terminating. Reason: killed. Last message: {‘EXIT’,<0.2189.0>,killed}. State: {state,<0.2189.0>,#Ref<0.1213154094.3492413455.243504>}.
2024-01-10T15:44:35.311288+08:00 [error] crasher: initial call: mnesia_subscr:init/1, pid: <0.2191.0>, registered_name: mnesia_subscr, exit: {killed,[{gen_server,decode_msg,9,[{file,“gen_server.erl”},{line,909}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [mnesia_kernel_sup,mnesia_sup,<0.2185.0>], message_queue_len: 0, messages: , links: , dictionary: , trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2429; neighbours:
2024-01-10T15:44:35.311583+08:00 [error] Generic server mnesia_monitor terminating. Reason: killed. Last message: {‘EXIT’,<0.2189.0>,killed}. State: {state,<0.2189.0>,,,true,,undefined,,}.
2024-01-10T15:44:35.310688+08:00 [error] crasher: initial call: mnesia_recover:init/1, pid: <0.2193.0>, registered_name: mnesia_recover, exit: {killed,[{gen_server,decode_msg,9,[{file,“gen_server.erl”},{line,909}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [mnesia_kernel_sup,mnesia_sup,<0.2185.0>], message_queue_len: 0, messages: , links: , dictionary: , trap_exit: true, status: running, heap_size: 4185, stack_size: 28, reductions: 5236; neighbours:
2024-01-10T15:44:35.311388+08:00 [error] crasher: initial call: application_master:init/4, pid: <0.2184.0>, registered_name: , exit: {{normal,{mnesia_app,start,[normal,]}},[{application_master,init,4,[{file,“application_master.erl”},{line,142}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [<0.2183.0>], message_queue_len: 1, messages: [{‘EXIT’,<0.2185.0>,normal}], links: [<0.2183.0>,<0.1998.0>], dictionary: , trap_exit: true, status: running, heap_size: 376, stack_size: 28, reductions: 195; neighbours:
2024-01-10T15:44:35.311015+08:00 [error] crasher: initial call: gen_event:init_it/6, pid: <0.2187.0>, registered_name: mnesia_event, exit: {killed,[{gen_event,terminate_server,4,[{file,“gen_event.erl”},{line,580}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [mnesia_sup,<0.2185.0>], message_queue_len: 1, messages: [{notify,{mnesia_system_event,{mnesia_down,‘emqxcl@10.2.4.36’}}}], links: , dictionary: , trap_exit: true, status: running, heap_size: 4185, stack_size: 28, reductions: 3694; neighbours:
2024-01-10T15:44:35.312518+08:00 [error] crasher: initial call: mnesia_monitor:init/1, pid: <0.2190.0>, registered_name: mnesia_monitor, exit: {killed,[{gen_server,decode_msg,9,[{file,“gen_server.erl”},{line,909}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [mnesia_kernel_sup,mnesia_sup,<0.2185.0>], message_queue_len: 0, messages: , links: [<0.2221.0>,<60465.2201.0>], dictionary: , trap_exit: true, status: running, heap_size: 4185, stack_size: 28, reductions: 7447; neighbours:
2024-01-10T15:44:35.312539+08:00 [error] crasher: initial call: application_master:init/4, pid: <0.2176.0>, registered_name: , exit: {{bad_return,{{mria_app,start,[normal,]},{‘EXIT’,{{badmatch,{error,{normal,{mnesia_app,start,[normal,]}}}},[{mria_mnesia,ensure_started,0,[{file,“mria_mnesia.erl”},{line,112}]},{mria_app,start,2,[{file,“mria_app.erl”},{line,36}]},{application_master,start_it_old,4,[{file,“application_master.erl”},{line,293}]}]}}}},[{application_master,init,4,[{file,“application_master.erl”},{line,142}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [<0.2175.0>], message_queue_len: 1, messages: [{‘EXIT’,<0.2177.0>,normal}], links: [<0.2175.0>,<0.1998.0>], dictionary: , trap_exit: true, status: running, heap_size: 376, stack_size: 28, reductions: 195; neighbours:
2024-01-10T15:44:35.312970+08:00 [error] crasher: initial call: application_master:init/4, pid: <0.2173.0>, registered_name: , exit: {{bad_return,{{emqx_machine_app,start,[normal,]},{‘EXIT’,{{badmatch,{error,{mria,{bad_return,{{mria_app,start,[normal,]},{‘EXIT’,{{badmatch,{error,{normal,{mnesia_app,start,[normal,]}}}},[{mria_mnesia,ensure_started,0,[{file,“mria_mnesia.erl”},{line,112}]},{mria_app,start,2,[{file,“mria_app.erl”},{line,36}]},{application_master,start_it_old,4,[{file,“application_master.erl”},{line,293}]}]}}}}}}},[{mria,start,0,[{file,“mria.erl”},{line,125}]},{ekka,start,0,[{file,“ekka.erl”},{line,94}]},{emqx_machine,start,0,[{file,“emqx_machine.erl”},{line,54}]},{emqx_machine_app,start,2,[{file,“emqx_machine_app.erl”},{line,29}]},{application_master,start_it_old,4,[{file,“application_master.erl”},{line,293}]}]}}}},[{application_master,init,4,[{file,“application_master.erl”},{line,142}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [<0.2172.0>], message_queue_len: 1, messages: [{‘EXIT’,<0.2174.0>,normal}], links: [<0.2172.0>,<0.1998.0>], dictionary: , trap_exit: true, status: running, heap_size: 987, stack_size: 28, reductions: 221; neighbours:

结果如下:

cluster {
autoclean = 24h
autoheal = true
discovery_strategy = manual
name = emqxclu
proto_dist = inet_tcp
}

非常感谢!
我们先分析下,一时还没看出来.

还有,麻烦问一下:
挂掉的节点是 core 节点,还是 replicant, 挂掉是因为机器故障重启了,还是软件问题?
一共有多少个节点?

./bin/emqx ctl cluster status

./bin/emqx ctl conf show node

一共有两个节点机器,两个节点都是core节点,没有使用replicant,挂掉是机器故障重启

你们可以尝试复现下:两节点集群,一个节点挂了,但集群还是一直在提供服务,然后我们这边就是在后台通过客户端一直推送消息的测试场景,就没管了,后来发现有个节点挂了好几天,想重新拉起就一直失败

好的,感谢感谢。

有什么方法解决么,我的同样出现该问题,使用的helm chart 部署的3节点集群,物理机节点重启后就出现了这个问题

[root@cloudbase-1 ~]# kubectl get pods -n emqx-cluster
NAME READY STATUS RESTARTS AGE
emqx-clsuter-0 1/1 Running 0 20h
emqx-clsuter-1 0/1 CrashLoopBackOff 186 (4m48s ago) 16h
emqx-clsuter-2 1/1 Running 0 184d

emqx@emqx-clsuter-0:/opt/emqx$ ./bin/emqx ctl conf show cluster
cluster {
autoclean = 24h
autoheal = true
discovery_strategy = dns
dns {name = emqx-clsuter-headless.emqx-cluster.svc.cluster.local, record_type = srv}
name = emqxcl
proto_dist = inet_tcp
}

WARNING: Default (insecure) Erlang cookie is in use.
WARNING: Configure node.cookie in /opt/emqx/etc/emqx.conf or override from environment variable EMQX_NODE__COOKIE
WARNING: NOTE: Use the same cookie for all nodes in the cluster.
EMQX_DASHBOARD__DEFAULT_PASSWORD [dashboard.default_password]: ******
EMQX_DASHBOARD__DEFAULT_USERNAME [dashboard.default_username]: admin
EMQX_RPC__PORT_DISCOVERY [rpc.port_discovery]: manual
EMQX_CLUSTER__DNS__RECORD_TYPE [cluster.dns.record_type]: srv
EMQX_CLUSTER__DNS__NAME [cluster.dns.name]: emqx-clsuter-headless.emqx-cluster.svc.cluster.local
EMQX_CLUSTER__DISCOVERY_STRATEGY [cluster.discovery_strategy]: dns
EMQX_NODE__NAME [node.name]: emqx-clsuter@emqx-clsuter-1.emqx-clsuter-headless.emqx-cluster.svc.cluster.local
2024-01-26T01:31:34.127595+00:00 [error] Mnesia(‘emqx-clsuter@emqx-clsuter-1.emqx-clsuter-headless.emqx-cluster.svc.cluster.local’): ** ERROR ** (core dumped to file: “/opt/emqx/MnesiaCore.emqx-clsuter@emqx-clsuter-1.emqx-clsuter-headless.emqx-cluster.svc.cluster.local_1706_232694_127173”), ** FATAL ** Failed to merge schema: {aborted,function_clause}
2024-01-26T01:31:44.128183+00:00 [error] crasher: initial call: application_master:init/4, pid: <0.1963.0>, registered_name: , exit: {{normal,{mnesia_app,start,[normal,]}},[{application_master,init,4,[{file,“application_master.erl”},{line,142}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [<0.1962.0>], message_queue_len: 1, messages: [{‘EXIT’,<0.1964.0>,normal}], links: [<0.1962.0>,<0.1801.0>], dictionary: , trap_exit: true, status: running, heap_size: 376, stack_size: 28, reductions: 170; neighbours:
2024-01-26T01:31:44.128519+00:00 [error] Generic server mnesia_subscr terminating. Reason: killed. Last message: {‘EXIT’,<0.1968.0>,killed}. State: {state,<0.1968.0>,#Ref<0.3521175414.1904869377.219840>}.
2024-01-26T01:31:44.128657+00:00 [error] crasher: initial call: mnesia_subscr:init/1, pid: <0.1970.0>, registered_name: mnesia_subscr, exit: {killed,[{gen_server,decode_msg,9,[{file,“gen_server.erl”},{line,909}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [mnesia_kernel_sup,mnesia_sup,<0.1964.0>], message_queue_len: 0, messages: , links: , dictionary: , trap_exit: true, status: running, heap_size: 1598, stack_size: 28, reductions: 2398; neighbours:
2024-01-26T01:31:44.128446+00:00 [error] Generic server mnesia_recover terminating. Reason: killed. Last message: {‘EXIT’,<0.1968.0>,killed}. State: {state,<0.1968.0>,undefined,undefined,undefined,0,false,true,}.
2024-01-26T01:31:44.128537+00:00 [error] Generic server mnesia_monitor terminating. Reason: killed. Last message: {‘EXIT’,<0.1968.0>,killed}. State: {state,<0.1968.0>,true,undefined,}.
2024-01-26T01:31:44.128821+00:00 [error] crasher: initial call: mnesia_recover:init/1, pid: <0.1972.0>, registered_name: mnesia_recover, exit: {killed,[{gen_server,decode_msg,9,[{file,“gen_server.erl”},{line,909}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [mnesia_kernel_sup,mnesia_sup,<0.1964.0>], message_queue_len: 0, messages: , links: , dictionary: , trap_exit: true, status: running, heap_size: 1598, stack_size: 28, reductions: 6718; neighbours:
2024-01-26T01:31:44.128627+00:00 [error] crasher: initial call: gen_event:init_it/6, pid: <0.1966.0>, registered_name: mnesia_event, exit: {killed,[{gen_event,terminate_server,4,[{file,“gen_event.erl”},{line,580}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [mnesia_sup,<0.1964.0>], message_queue_len: 1, messages: [{notify,{mnesia_system_event,{mnesia_down,‘emqx-clsuter@emqx-clsuter-1.emqx-clsuter-headless.emqx-cluster.svc.cluster.local’}}}], links: , dictionary: , trap_exit: true, status: running, heap_size: 4185, stack_size: 28, reductions: 4574; neighbours:
2024-01-26T01:31:44.128652+00:00 [error] crasher: initial call: application_master:init/4, pid: <0.1955.0>, registered_name: , exit: {{bad_return,{{mria_app,start,[normal,]},{‘EXIT’,{{badmatch,{error,{normal,{mnesia_app,start,[normal,]}}}},[{mria_mnesia,ensure_started,0,[{file,“mria_mnesia.erl”},{line,112}]},{mria_app,start,2,[{file,“mria_app.erl”},{line,36}]},{application_master,start_it_old,4,[{file,“application_master.erl”},{line,293}]}]}}}},[{application_master,init,4,[{file,“application_master.erl”},{line,142}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [<0.1954.0>], message_queue_len: 1, messages: [{‘EXIT’,<0.1956.0>,normal}], links: [<0.1954.0>,<0.1801.0>], dictionary: , trap_exit: true, status: running, heap_size: 376, stack_size: 28, reductions: 167; neighbours:
2024-01-26T01:31:44.129103+00:00 [error] crasher: initial call: application_master:init/4, pid: <0.1952.0>, registered_name: , exit: {{bad_return,{{emqx_machine_app,start,[normal,]},{‘EXIT’,{{badmatch,{error,{mria,{bad_return,{{mria_app,start,[normal,]},{‘EXIT’,{{badmatch,{error,{normal,{mnesia_app,start,[normal,]}}}},[{mria_mnesia,ensure_started,0,[{file,“mria_mnesia.erl”},{line,112}]},{mria_app,start,2,[{file,“mria_app.erl”},{line,36}]},{application_master,start_it_old,4,[{file,“application_master.erl”},{line,293}]}]}}}}}}},[{mria,start,0,[{file,“mria.erl”},{line,124}]},{ekka,start,0,[{file,“ekka.erl”},{line,94}]},{emqx_machine,start,0,[{file,“emqx_machine.erl”},{line,45}]},{emqx_machine_app,start,2,[{file,“emqx_machine_app.erl”},{line,27}]},{application_master,start_it_old,4,[{file,“application_master.erl”},{line,293}]}]}}}},[{application_master,init,4,[{file,“application_master.erl”},{line,142}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [<0.1951.0>], message_queue_len: 1, messages: [{‘EXIT’,<0.1953.0>,normal}], links: [<0.1951.0>,<0.1801.0>], dictionary: , trap_exit: true, status: running, heap_size: 610, stack_size: 28, reductions: 188; neighbours:
2024-01-26T01:31:44.129611+00:00 [error] crasher: initial call: mnesia_monitor:init/1, pid: <0.1969.0>, registered_name: mnesia_monitor, exit: {killed,[{gen_server,decode_msg,9,[{file,“gen_server.erl”},{line,909}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,240}]}]}, ancestors: [mnesia_kernel_sup,mnesia_sup,<0.1964.0>], message_queue_len: 1, messages: [{‘$gen_call’,{<0.1973.0>,#Ref<0.3521175414.1904738306.219462>},{close_log,latest_log}}], links: [<56870.1969.0>,<56871.1974.0>,<0.2009.0>], dictionary: , trap_exit: true, status: running, heap_size: 4185, stack_size: 28, reductions: 10473; neighbours:
{“Kernel pid terminated”,application_controller,“{application_start_failure,emqx_machine,{bad_return,{{emqx_machine_app,start,[normal,]},{‘EXIT’,{{badmatch,{error,{mria,{bad_return,{{mria_app,start,[normal,]},{‘EXIT’,{{badmatch,{error,{normal,{mnesia_app,start,[normal,]}}}},[{mria_mnesia,ensure_started,0,[{file,“mria_mnesia.erl”},{line,112}]},{mria_app,start,2,[{file,“mria_app.erl”},{line,36}]},{application_master,start_it_old,4,[{file,“application_master.erl”},{line,293}]}]}}}}}}},[{mria,start,0,[{file,“mria.erl”},{line,124}]},{ekka,start,0,[{file,“ekka.erl”},{line,94}]},{emqx_machine,start,0,[{file,“emqx_machine.erl”},{line,45}]},{emqx_machine_app,start,2,[{file,“emqx_machine_app.erl”},{line,27}]},{application_master,start_it_old,4,[{file,“application_master.erl”},{line,293}]}]}}}}}”}
Kernel pid terminated (application_controller) ({application_start_failure,emqx_machine,{bad_return,{{emqx_machine_app,start,[normal,]},{‘EXIT’,{{badmatch,{error,{mria,{bad_return,{{mria_app,start,[normal,]},{‘EXIT’,{{badmatch,{error,{normal,{mnesia_app,start,[normal,]}}}},[{mria_mnesia,ensure_started,0,[{file,“mria_mnesia.erl”},{line,112}]},{mria_app,start,2,[{file,“mria_app.erl”},{line,36}]},{application_master,start_it_old,4,[{file,“application_master.erl”},{line,293}]}]}}}}}}},[{mria,start,0,[{file,“mria.erl”},{line,124}]},{ekka,start,0,[{file,“ekka.erl”},{line,94}]},{emqx_machine,start,0,[{file,“emqx_machine.erl”},{line,45}]},{emqx_machine_app,start,2,[{file,“emqx_machine_app.erl”},{line,27}]},{application_master,start_it_old,4,[{file,“application_master.erl”},{line,293}]}]}}}}})

Crash dump is being written to: /opt/emqx/log/erl_crash.dump…done

cluster.autoclean 默认值是 24h 这意味着正常核心节点将在一天内自动从集群中移除已停止节点的数据。

如果停止节点后 “几天” 重新启动,它将以旧版本的 Mnesia schema 加入原集群,而其他集群节点已经有了新版本的schema。因此出现 “ failed to merge schema” 错误。这个错误发生在 Mnesia 尝试找到一个可以协调 schema 的解决方案时,但失败了。它将停止自己以避免对集群的损害。

解决方案是:

删除有问题节点的数据目录(即停止的节点),并将其重新连接到集群。他会复制正常运行集群的新数据和 schema 的。

1 个赞