emqx 5.0.8 win 服务崩溃

错误报告

环境

  • EMQX 版本:5.0.8 单节点
  • 操作系统版本:win server 2008 r2

重现此问题的步骤

服务器运行,偶发服务崩溃
端口1883改为21883,其它配置未动

实际行为

2022-12-30T19:23:12.223000+08:00 [error] crasher: initial call: memsup:init/1, pid: <0.29073.0>, registered_name: memsup, exit: {{timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}},[{gen_server,handle_common_reply,8,[{file,“gen_server.erl”},{line,811}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [os_mon_sup,<0.1728.0>], message_queue_len: 0, messages: [], links: [<0.1729.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 29, reductions: 8183; neighbours:
2022-12-30T19:23:12.223000+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}. Offender: id=memsup,pid=<0.29073.0>.
2022-12-30T19:23:17.219000+08:00 [error] Generic server memsup terminating. Reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}. Last message: {‘EXIT’,<0.29079.0>,{timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}}. State: {state,{win32,nt},false,undefined,undefined,false,60000,30000,0.8,0.05,<0.29079.0>,#Ref<0.2776357581.4128768001.257717>,undefined,[{ext,{<0.1851.0>,#Ref<0.2776357581.4128768001.257774>}},reg],[]}.
2022-12-30T19:23:17.219000+08:00 [error] crasher: initial call: memsup:init/1, pid: <0.29078.0>, registered_name: memsup, exit: {{timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}},[{gen_server,handle_common_reply,8,[{file,“gen_server.erl”},{line,811}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [os_mon_sup,<0.1728.0>], message_queue_len: 0, messages: [], links: [<0.1729.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 29, reductions: 8183; neighbours:
2022-12-30T19:23:17.220000+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}. Offender: id=memsup,pid=<0.29078.0>.
2022-12-30T19:23:22.221000+08:00 [error] Generic server memsup terminating. Reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}. Last message: {‘EXIT’,<0.29082.0>,{timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}}. State: {state,{win32,nt},false,undefined,undefined,false,60000,30000,0.8,0.05,<0.29082.0>,#Ref<0.2776357581.4128768001.257988>,undefined,[{ext,{<0.1851.0>,#Ref<0.2776357581.4128768001.258044>}},reg],[]}.
2022-12-30T19:23:22.221000+08:00 [error] crasher: initial call: memsup:init/1, pid: <0.29081.0>, registered_name: memsup, exit: {{timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}},[{gen_server,handle_common_reply,8,[{file,“gen_server.erl”},{line,811}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [os_mon_sup,<0.1728.0>], message_queue_len: 0, messages: [], links: [<0.1729.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 29, reductions: 8183; neighbours:
2022-12-30T19:23:22.228000+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}. Offender: id=memsup,pid=<0.29081.0>.
2022-12-30T19:23:27.229000+08:00 [error] Generic server memsup terminating. Reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}. Last message: {‘EXIT’,<0.29086.0>,{timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}}. State: {state,{win32,nt},false,undefined,undefined,false,60000,30000,0.8,0.05,<0.29086.0>,#Ref<0.2776357581.4128768001.258247>,undefined,[{ext,{<0.1851.0>,#Ref<0.2776357581.4128768001.258298>}},reg],[]}.
2022-12-30T19:23:27.229000+08:00 [error] crasher: initial call: memsup:init/1, pid: <0.29085.0>, registered_name: memsup, exit: {{timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}},[{gen_server,handle_common_reply,8,[{file,“gen_server.erl”},{line,811}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [os_mon_sup,<0.1728.0>], message_queue_len: 0, messages: [], links: [<0.1729.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 29, reductions: 8183; neighbours:
2022-12-30T19:23:27.229000+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}. Offender: id=memsup,pid=<0.29085.0>.
2022-12-30T19:23:32.232000+08:00 [error] Generic server memsup terminating. Reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}. Last message: {‘EXIT’,<0.29089.0>,{timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}}. State: {state,{win32,nt},false,undefined,undefined,false,60000,30000,0.8,0.05,<0.29089.0>,#Ref<0.2776357581.4128768001.258505>,undefined,[{ext,{<0.1851.0>,#Ref<0.2776357581.4128768001.258561>}},reg],[]}.
2022-12-30T19:23:32.232000+08:00 [error] crasher: initial call: memsup:init/1, pid: <0.29088.0>, registered_name: memsup, exit: {{timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}},[{gen_server,handle_common_reply,8,[{file,“gen_server.erl”},{line,811}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [os_mon_sup,<0.1728.0>], message_queue_len: 0, messages: [], links: [<0.1729.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 29, reductions: 8183; neighbours:
2022-12-30T19:23:32.233000+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_mem_info]}}. Offender: id=memsup,pid=<0.29088.0>.
2022-12-30T19:23:32.233000+08:00 [error] Supervisor: {local,os_mon_sup}. Context: shutdown. Reason: reached_max_restart_intensity. Offender: id=memsup,pid=<0.29088.0>.

产生dump文件如下,其中包含日志文件和dump文件
log.zip (662.3 KB)

看日志,我个人大致的理解是memsup服务调用get_mem_info接口超时了,但是不太理解为什么会产生这样的问题,并且后续如何避免或者改善这种情况,防止服务不可用。
服务器上面也不是经常出现,目前只有开发和生产服务器各出现了一次
谢谢

你看下是不是你的 CPU 占用本身就比较高了。另外最好还是使用 Linux 系统来部署 EMQX。

cpu还好,在30%左右,内存在40%左右,客户和公司要求win server 2008 r2, 这个之前专门问过,不会改用Linux的,没办法

了解了,这个 timeout 是每次 EMQX 启动都会出现吗?还是偶现的?

偶发的,用了有快2个月了,目前只出现过2次,第一次没在意,这次又出现了,所以想查一查原因


刚又查了下服务,昨天启动的,今天早上又出现了2次timeout,但是没自动退出,可能是次数没达到吧

我反馈给研发同事看下,你可以继续先观察下 CPU 占用情况。

你好,建议密切关注下服务器各项资源的占用情况,主要是内存这块,这里持续超时本质上还是资源方面的问题。

服务器内存32G,一般占用不会超过20G,这个应该问题不大吧?