emqx运行了几天后,在突然停止。无法重新启动,需要重启服务器才可以恢复

环境

  • EMQX 版本:4.3.22
  • 操作系统版本:linux amd x64 centos7

出现问题

  1. emqx运行了几天后,在突然停止

磁盘空间情况:
[root@VM-16-11-centos ~]# df
文件系统 1K-块 已用 可用 已用% 挂载点
/dev/vda1 51473868 13820484 35456140 29% /

emqx.log

2023-05-12T14:15:06.630661+08:00 [error] Generic server disksup terminating. Reason: {badarg,[{erlang,port_close,[#Port<0.8>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}. Last message: timeout. State: [{data,[{“OS”,{unix,linux}},{“Timeout”,1800000},{“Threshold”,80},{“DiskData”,[{“/dev”,7942056,0},{“/dev/shm”,7953060,1},{“/run”,7953060,1},{“/sys/fs/cgroup”,7953060,0},{“/”,51473868,29},{“/run/user/0”,1590612,0},{“/run/user/1002”,1590612,0}]}]}].
2023-05-12T14:15:06.630972+08:00 [error] crasher: initial call: disksup:init/1, pid: <0.1547.0>, registered_name: disksup, error: {badarg,[{erlang,port_close,[#Port<0.8>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [os_mon_sup,<0.1545.0>], message_queue_len: 0, messages: [], links: [<0.1546.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 662862; neighbours:
2023-05-12T14:15:06.631209+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {badarg,[{erlang,port_close,[#Port<0.8>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}. Offender: id=disksup,pid=<0.1547.0>.
2023-05-12T14:15:06.633515+08:00 [error] Generic server disksup terminating. Reason: {badarg,[{erlang,port_close,[#Port<0.17981>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}. Last message: timeout. State: [{data,[{“OS”,{unix,linux}},{“Timeout”,1800000},{“Threshold”,80},{“DiskData”,[]}]}].
2023-05-12T14:15:06.633668+08:00 [error] crasher: initial call: disksup:init/1, pid: <0.1230.5>, registered_name: disksup, error: {badarg,[{erlang,port_close,[#Port<0.17981>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [os_mon_sup,<0.1545.0>], message_queue_len: 0, messages: [], links: [<0.1546.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 9046; neighbours:
2023-05-12T14:15:06.633918+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {badarg,[{erlang,port_close,[#Port<0.17981>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}. Offender: id=disksup,pid=<0.1230.5>.
2023-05-12T14:15:06.635967+08:00 [error] Generic server disksup terminating. Reason: {badarg,[{erlang,port_close,[#Port<0.17982>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}. Last message: timeout. State: [{data,[{“OS”,{unix,linux}},{“Timeout”,1800000},{“Threshold”,80},{“DiskData”,[]}]}].
2023-05-12T14:15:06.636120+08:00 [error] crasher: initial call: disksup:init/1, pid: <0.1231.5>, registered_name: disksup, error: {badarg,[{erlang,port_close,[#Port<0.17982>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [os_mon_sup,<0.1545.0>], message_queue_len: 0, messages: [], links: [<0.1546.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 9046; neighbours:
2023-05-12T14:15:06.636401+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {badarg,[{erlang,port_close,[#Port<0.17982>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}. Offender: id=disksup,pid=<0.1231.5>.
2023-05-12T14:15:06.638689+08:00 [error] Generic server disksup terminating. Reason: {badarg,[{erlang,port_close,[#Port<0.17983>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}. Last message: timeout. State: [{data,[{“OS”,{unix,linux}},{“Timeout”,1800000},{“Threshold”,80},{“DiskData”,[]}]}].
2023-05-12T14:15:06.638901+08:00 [error] crasher: initial call: disksup:init/1, pid: <0.1232.5>, registered_name: disksup, error: {badarg,[{erlang,port_close,[#Port<0.17983>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [os_mon_sup,<0.1545.0>], message_queue_len: 0, messages: [], links: [<0.1546.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 9046; neighbours:
2023-05-12T14:15:06.639177+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {badarg,[{erlang,port_close,[#Port<0.17983>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}. Offender: id=disksup,pid=<0.1232.5>.
2023-05-12T14:15:06.646668+08:00 [error] Generic server disksup terminating. Reason: {badarg,[{erlang,port_close,[#Port<0.17984>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}. Last message: timeout. State: [{data,[{“OS”,{unix,linux}},{“Timeout”,1800000},{“Threshold”,80},{“DiskData”,[]}]}].
2023-05-12T14:15:06.646920+08:00 [error] crasher: initial call: disksup:init/1, pid: <0.1233.5>, registered_name: disksup, error: {badarg,[{erlang,port_close,[#Port<0.17984>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [os_mon_sup,<0.1545.0>], message_queue_len: 0, messages: [], links: [<0.1546.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 9046; neighbours:
2023-05-12T14:15:06.647212+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {badarg,[{erlang,port_close,[#Port<0.17984>],[]},{disksup,terminate,2,[{file,“disksup.erl”},{line,169}]},{gen_server,try_terminate,3,[{file,“gen_server.erl”},{line,727}]},{gen_server,terminate,10,[{file,“gen_server.erl”},{line,912}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}. Offender: id=disksup,pid=<0.1233.5>.
2023-05-12T14:15:06.647371+08:00 [error] Supervisor: {local,os_mon_sup}. Context: shutdown. Reason: reached_max_restart_intensity. Offender: id=disksup,pid=<0.1233.5>.
2023-05-12T14:15:06.678762+08:00 [error] [Pool] Error: badarg, [{ets,delete_object,[emqx_channel,{<<“34851805F034”>>,<0.697.5>}],[]},{emqx_cm,do_unregister_channel,1,[{file,“emqx_cm.erl”},{line,149}]},{lists,foreach,2,[{file,“lists.erl”},{line,1342}]},{emqx_pool,handle_cast,2,[{file,“emqx_pool.erl”},{line,108}]},{gen_server,try_dispatch,4,[{file,“gen_server.erl”},{line,689}]},{gen_server,handle_msg,6,[{file,“gen_server.erl”},{line,765}]},{proc_lib,wake_up,3,[{file,“proc_lib.erl”},{line,236}]}]

erlang.log

===== Fri May 12 14:15:06 CST 2023
[os_mon] memory supervisor port (memsup): Erlang has closed

[os_mon] cpu supervisor port (cpu_sup): Erlang has closed

Stop http:management listener on 0.0.0.0:8081 successfully.
(emqx@127.0.0.1)1> {“Kernel pid terminated”,application_controller,“{application_terminated,os_mon,shutdown}”}

Kernel pid terminated (application_controller) ({application_terminated,os_mon,shutdown})

Crash dump is being written to: /var/log/emqx/crash.dump…done

=====
===== LOGGING STARTED Fri May 12 14:18:57 CST 2023

Failed to create dirty io scheduler thread 9, error = 11

/usr/bin/emqx: line 238: 3697 Aborted env ERL_CRASH_DUMP_BYTES=0 “$BINDIR/$PROGNAME” -boot “$REL_DIR/start_clean” -eval “crypto:start(),halt()”
FATAL: Unable to start Erlang.
Please make sure openssl-1.1.1 (libcrypto) and libncurses are installed.
Also ensure it’s running on the correct platform,
this EMQX release is built for 23.3.4.9-3-x86_64-unknown-linux-gnu-64-centos7

crash.dump

crash.zip (411.1 KB)

真正原因可能是因为这个,创建线程失败了。猜测可能云服务商有限制最大进程和线程数、也可能是你的 ulimit 设置不够大。

[root@VM-16-11-centos ~]# ulimit -n
100001
[root@VM-16-11-centos ~]# ulimit -u
62047


这台服务器的上线设备很小量,ulimit 也是较大的调教