- EMQX 版本 5.7.2
- 操作系统:Huawei Cloud EulerOS 2.0 (x86_64)
- 部署方式:k8s集群,一个core副本 两个repl副本
昨天上午开始repl频繁重启,部分日志频繁出现,如下
2026-06-22T13:58:37.108733491+08:00 2026-06-22T05:58:37.108443+00:00 [error] event=connect_to_remote_server, peer=emqx-ev-xxxx@10.32.106.203, port=5369, reason=timeout
2026-06-22T13:58:37.108934617+08:00 2026-06-22T05:58:37.108664+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.346398.0>, registered_name: [], exit: {{badrpc,timeout},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,961}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2290.0>], message_queue_len: 0, messages: [], links: [<0.2296.0>], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 28, reductions: 3832; neighbours:
2026-06-22T15:17:05.208899397+08:00 2026-06-22T07:17:05.208586+00:00 [error] State machine {acceptor,{{10,32,103,227},51026}} terminating. Reason: {badtcp,closed}. Stack: [{gen_statem,loop_state_callback_result,11,[{file,"gen_statem.erl"},{line,1524}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]. Last event: {{call,{<0.2293.0>,#Ref<0.1191865503.3325296642.42693>}},{socket_ready,#Port<0.344147>}}. State: {waiting_for_socket,{state,#Port<0.344147>,tcp,gen_rpc_driver_tcp,tcp_closed,tcp_error,{{10,32,103,227},51026},disabled,disabled}}. Client gen_rpc_server_tcp stacktrace: [{prim_inet,accept0,3,[]},{inet_tcp,accept,2,[{file,"inet_tcp.erl"},{line,227}]},{gen_rpc_server,waiting_for_connection,3,[{file,"gen_rpc_server.erl"},{line,70}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1395}]}].
2026-06-22T15:17:05.209027991+08:00 2026-06-22T07:17:05.208853+00:00 [error] crasher: initial call: gen_rpc_acceptor:init/1, pid: <0.352263.0>, registered_name: [], exit: {{badtcp,closed},[{gen_statem,loop_state_callback_result,11,[{file,"gen_statem.erl"},{line,1524}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}, ancestors: [gen_rpc_acceptor_sup,gen_rpc_sup,<0.2290.0>], message_queue_len: 0, messages: [], links: [<0.2294.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 10972; neighbours:
2026-06-22T15:17:05.209167984+08:00 2026-06-22T07:17:05.209009+00:00 [error] Supervisor: {local,gen_rpc_acceptor_sup}. Context: child_terminated. Reason: {badtcp,closed}. Offender: id=gen_rpc_acceptor,pid=<0.352263.0>.
2026-06-22T15:17:53.451764538+08:00 2026-06-22T07:17:53.451459+00:00 [error] msg: gen_rpc_client_auth_timeout, error: closed, peer: {{10,32,103,227},59782}
以下是容器退出前的日志,
2026-06-23T13:40:37.63292387+08:00 stdout F 2026-06-23T05:40:37.632778+00:00 [error] Supervisor: {local,emqx_sys_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[application_controller,which_applications]}}. Offender: id=emqx_os_mon,pid=<0.3873.0>.
2026-06-23T13:40:47.176021546+08:00 stdout F 2026-06-23T05:40:47.174471+00:00 [error] supervisor: {esockd_connection_sup,[<0.345896.0>,<0.347688.0>,<0.911026.0>,<0.911048.0>,<0.389973.0>,<0.271255.0>]}, errorContext: connection_shutdown_error, reason: {shutdown,keepalive_timeout}, offender: [{pid,[<0.345896.0>,<0.347688.0>,<0.911026.0>,<0.911048.0>,<0.389973.0>,<0.271255.0>]},{name,connection},{mfargs,{emqx_connection,start_link,[#{listener => {tcp,default},limiter => #{connection => #{initial => 0,rate => infinity,burst => 0}},zone => default,enable_authn => true}]}}]
2026-06-23T13:40:47.176043626+08:00 stdout F 2026-06-23T05:40:47.174655+00:00 [error] supervisor: {esockd_connection_sup,[<0.914267.0>,<0.914354.0>]}, errorContext: connection_shutdown_error, reason: {shutdown,tcp_closed}, offender: [{pid,[<0.914267.0>,<0.914354.0>]},{name,connection},{mfargs,{emqx_connection,start_link,[#{listener => {tcp,default},limiter => #{connection => #{initial => 0,rate => infinity,burst => 0}},zone => default,enable_authn => true}]}}]
2026-06-23T13:40:47.176046056+08:00 stdout F Listener tcp:default on :30011 stopped.
2026-06-23T13:40:47.177194827+08:00 stdout F Listener tcp:server on :30012 stopped.
core 节点也频繁出现以下日志
2026-06-23T10:35:46.410097806+08:00 2026-06-23T02:35:46.405759+00:00 [error] lock_owner_status:, [{status,waiting},{message_queue_len,2},{current_stacktrace,[{optvar,read,2,[{file,"optvar.erl"},{line,135}]},{mria,find_upstream_node,1,[{file,"mria.erl"},{line,587}]},{mria,rpc_to_core_node,5,[{file,"mria.erl"},{line,552}]},{emqx_cm,register_channel,3,[{file,"emqx_cm.erl"},{line,190}]},{emqx_cm,'-open_session/4-fun-1-',6,[{file,"emqx_cm.erl"},{line,302}]},{emqx_cm_locker,trans,2,[{file,"emqx_cm_locker.erl"},{line,44}]},{emqx_channel,process_connect,2,[{file,"emqx_channel.erl"},{line,587}]},{emqx_connection,with_channel,3,[{file,"emqx_connection.erl"},{line,811}]},{emqx_connection,process_msg,2,[{file,"emqx_connection.erl"},{line,472}]},{emqx_connection,process_msg,2,[{file,"emqx_connection.erl"},{line,478}]},{emqx_connection,handle_recv,3,[{file,"emqx_connection.erl"},{line,434}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]}]
2026-06-23T10:35:46.408227997+08:00 2026-06-23T02:35:46.401554+00:00 [error] kill <66063.397944.0> as it has held the lock for too long, resource: <<"xxxxxxx">>
观察监控,2天以前,cpu使用率(emqx_vm_cpu_use)一直在60-70%,内存使用率在50-60%,在昨天上午时,两个repl副本cpu使用率交替增至80%以上,然后开始交替宕机
截止至发稿前,增加了一个副本,日志里没有上述报错了,目前看起来和cpu高使用率有关系,导致pod宕机
by the way,emqx镜像里的工具太少了,缺少如telnet,curl类工具,也没法通过sudo apt的方式来安装