EMQX pod频繁重启

  • EMQX 版本 5.7.2
  • 操作系统:Huawei Cloud EulerOS 2.0 (x86_64)
  • 部署方式:k8s集群,一个core副本 两个repl副本

昨天上午开始repl频繁重启,部分日志频繁出现,如下
2026-06-22T13:58:37.108733491+08:00 2026-06-22T05:58:37.108443+00:00 [error] event=connect_to_remote_server, peer=emqx-ev-xxxx@10.32.106.203, port=5369, reason=timeout


2026-06-22T13:58:37.108934617+08:00 2026-06-22T05:58:37.108664+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.346398.0>, registered_name: [], exit: {{badrpc,timeout},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,961}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2290.0>], message_queue_len: 0, messages: [], links: [<0.2296.0>], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 28, reductions: 3832; neighbours:


2026-06-22T15:17:05.208899397+08:00 2026-06-22T07:17:05.208586+00:00 [error] State machine {acceptor,{{10,32,103,227},51026}} terminating. Reason: {badtcp,closed}. Stack: [{gen_statem,loop_state_callback_result,11,[{file,"gen_statem.erl"},{line,1524}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]. Last event: {{call,{<0.2293.0>,#Ref<0.1191865503.3325296642.42693>}},{socket_ready,#Port<0.344147>}}. State: {waiting_for_socket,{state,#Port<0.344147>,tcp,gen_rpc_driver_tcp,tcp_closed,tcp_error,{{10,32,103,227},51026},disabled,disabled}}. Client gen_rpc_server_tcp stacktrace: [{prim_inet,accept0,3,[]},{inet_tcp,accept,2,[{file,"inet_tcp.erl"},{line,227}]},{gen_rpc_server,waiting_for_connection,3,[{file,"gen_rpc_server.erl"},{line,70}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1395}]}].


2026-06-22T15:17:05.209027991+08:00 2026-06-22T07:17:05.208853+00:00 [error] crasher: initial call: gen_rpc_acceptor:init/1, pid: <0.352263.0>, registered_name: [], exit: {{badtcp,closed},[{gen_statem,loop_state_callback_result,11,[{file,"gen_statem.erl"},{line,1524}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}, ancestors: [gen_rpc_acceptor_sup,gen_rpc_sup,<0.2290.0>], message_queue_len: 0, messages: [], links: [<0.2294.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 10972; neighbours:


2026-06-22T15:17:05.209167984+08:00 2026-06-22T07:17:05.209009+00:00 [error] Supervisor: {local,gen_rpc_acceptor_sup}. Context: child_terminated. Reason: {badtcp,closed}. Offender: id=gen_rpc_acceptor,pid=<0.352263.0>.


2026-06-22T15:17:53.451764538+08:00 2026-06-22T07:17:53.451459+00:00 [error] msg: gen_rpc_client_auth_timeout, error: closed, peer: {{10,32,103,227},59782}

以下是容器退出前的日志,
2026-06-23T13:40:37.63292387+08:00 stdout F 2026-06-23T05:40:37.632778+00:00 [error] Supervisor: {local,emqx_sys_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[application_controller,which_applications]}}. Offender: id=emqx_os_mon,pid=<0.3873.0>.


2026-06-23T13:40:47.176021546+08:00 stdout F 2026-06-23T05:40:47.174471+00:00 [error] supervisor: {esockd_connection_sup,[<0.345896.0>,<0.347688.0>,<0.911026.0>,<0.911048.0>,<0.389973.0>,<0.271255.0>]}, errorContext: connection_shutdown_error, reason: {shutdown,keepalive_timeout}, offender: [{pid,[<0.345896.0>,<0.347688.0>,<0.911026.0>,<0.911048.0>,<0.389973.0>,<0.271255.0>]},{name,connection},{mfargs,{emqx_connection,start_link,[#{listener => {tcp,default},limiter => #{connection => #{initial => 0,rate => infinity,burst => 0}},zone => default,enable_authn => true}]}}]


2026-06-23T13:40:47.176043626+08:00 stdout F 2026-06-23T05:40:47.174655+00:00 [error] supervisor: {esockd_connection_sup,[<0.914267.0>,<0.914354.0>]}, errorContext: connection_shutdown_error, reason: {shutdown,tcp_closed}, offender: [{pid,[<0.914267.0>,<0.914354.0>]},{name,connection},{mfargs,{emqx_connection,start_link,[#{listener => {tcp,default},limiter => #{connection => #{initial => 0,rate => infinity,burst => 0}},zone => default,enable_authn => true}]}}]


2026-06-23T13:40:47.176046056+08:00 stdout F Listener tcp:default on :30011 stopped.


2026-06-23T13:40:47.177194827+08:00 stdout F Listener tcp:server on :30012 stopped.

core 节点也频繁出现以下日志
2026-06-23T10:35:46.410097806+08:00 2026-06-23T02:35:46.405759+00:00 [error] lock_owner_status:, [{status,waiting},{message_queue_len,2},{current_stacktrace,[{optvar,read,2,[{file,"optvar.erl"},{line,135}]},{mria,find_upstream_node,1,[{file,"mria.erl"},{line,587}]},{mria,rpc_to_core_node,5,[{file,"mria.erl"},{line,552}]},{emqx_cm,register_channel,3,[{file,"emqx_cm.erl"},{line,190}]},{emqx_cm,'-open_session/4-fun-1-',6,[{file,"emqx_cm.erl"},{line,302}]},{emqx_cm_locker,trans,2,[{file,"emqx_cm_locker.erl"},{line,44}]},{emqx_channel,process_connect,2,[{file,"emqx_channel.erl"},{line,587}]},{emqx_connection,with_channel,3,[{file,"emqx_connection.erl"},{line,811}]},{emqx_connection,process_msg,2,[{file,"emqx_connection.erl"},{line,472}]},{emqx_connection,process_msg,2,[{file,"emqx_connection.erl"},{line,478}]},{emqx_connection,handle_recv,3,[{file,"emqx_connection.erl"},{line,434}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]}]


2026-06-23T10:35:46.408227997+08:00 2026-06-23T02:35:46.401554+00:00 [error] kill <66063.397944.0> as it has held the lock for too long, resource: <<"xxxxxxx">>

观察监控,2天以前,cpu使用率(emqx_vm_cpu_use)一直在60-70%,内存使用率在50-60%,在昨天上午时,两个repl副本cpu使用率交替增至80%以上,然后开始交替宕机

截止至发稿前,增加了一个副本,日志里没有上述报错了,目前看起来和cpu高使用率有关系,导致pod宕机

by the way,emqx镜像里的工具太少了,缺少如telnet,curl类工具,也没法通过sudo apt的方式来安装

看起来像是集群内部 RPC + core 节点压力问题
connect_to_remote_server ... port=5369, reason=timeout 是节点间 RPC 连接超时。后面的 lock_owner_statuskill ... held the lock for too long 说明 core 上会话注册/锁路径已经卡住了;emqx_os_mon timeout 和 listener stopped 更像是节点退出前的结果。

kubectl -n <ns> describe pod <repl-pod>
kubectl -n <ns> get pod <repl-pod> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}{" "}{.status.containerStatuses[*].lastState.terminated.exitCode}{"
"}'
kubectl -n <ns> logs <repl-pod> --previous --tail=300
kubectl -n <ns> top pod -l app.kubernetes.io/name=emqx

再查 5369 这条集群 RPC 链路。EMQX 镜像工具少是正常的,别进镜像里装包,直接起临时 debug pod:

kubectl -n <ns> run netshoot --rm -it --restart=Never --image=nicolaka/netshoot -- sh
nc -vz <core-pod-ip> 5369
nc -vz <repl-pod-ip> 5369

重点看:

  • Pod 是否被 OOMKilled、liveness probe 杀掉,还是 EMQX 自己退出。
  • CPU limit 是否太低导致 throttling。Erlang VM 被限 CPU 后很容易出现 gen_server:call timeout、RPC timeout。
  • core 只有 1 个时,所有 replicant 写路径都要回 core;如果连接建立/断开很频繁,core 会先变成瓶颈。
    你现在是 1 core + 2 repl,这个规模一般没必要上 Core/Replicant。要么用 3 个 core 做正常小集群,要么把 core 资源先拉高并确认 5369 pod-to-pod 稳定。把上面 describe pod、资源 requests/limits、--previous 退出前 300 行贴出来,才能判断是资源打满、探针误杀,还是集群网络抖动。

目前连接数大概在 52w 左右,当初在设计架构时就考虑 core 节点不直接与设备连接,问了其他同事,之前两 repl pod 的 cpu 使用率在70-80%之间,当时并未处理,刚好在前两天使用高峰期时把 pod 打崩,后续节点扩容后无此问题

查看了上一个挂掉的容器状态,这会 events 里并未打印 liveness probe failed,不过之前容器在频繁重启时,会有打印 liveness probe failed

Last State:     Terminated
  Reason:       Error
  Exit Code:    137
  Started:      Tue, 23 Jun 2026 14:19:56 +0800
  Finished:     Tue, 23 Jun 2026 14:51:25 +0800

用 kubectl 看了上一个容器挂掉以前的日志,出现一条Received terminate signal, shutting down now,从打印这条日志到容器 Finished 中间差了有43 秒

起了一个 debug pod,可以连通其他 pod 5369 端口

资源使用方面没有限制,宿主机 CPU 和内存峰值未打满

看起来是按过载导致的探针重启处理,不像 5369 网络本身的问题。

nc 能通 5369,扩容后消失,一般是 RPC timeout 是节点压力导致的。
Received terminate signal 说明容器先收到了 kubelet 的 SIGTERM;最后 Exit Code: 137 是 SIGKILL,通常是 liveness 连续失败后,或者优雅退出没在 grace period 内结束被 kubelet 强杀。

52w 连接、2 个 repl CPU 70%-80%,高峰再遇到连接抖动/重连,replicant 承接连接会先打满;1 个 core 还要处理会话/路由/锁相关写路径,所以 lock_owner_status、RPC timeout 会一起出现。

  1. 保留扩容后的 repl 数量,按高峰 CPU < 60%-70% 留余量,不要让日常跑到 80%。
  2. liveness 别太激进,优先把 readiness 用来摘流量;liveness 的 failureThreshold / timeoutSeconds / periodSeconds 拉大一点,避免高峰短暂卡顿直接杀 pod。
  3. terminationGracePeriodSeconds 拉长到能完成 EMQX 优雅退出,否则会看到 SIGTERM 后最终 137。
  4. 继续用 Core/Replicant 的话,建议至少 3 core + N repl;1 core 是单点,连接 churn 大时写路径也容易被打满。
  5. 观测 core 上的 emqx_mria_server_mql、replicant 上的 emqx_mria_lag / emqx_mria_message_queue_len,如果这些指标在高峰持续升高,就是 core/复制链路容量不足。