如何配置最大内存

开源版本5.8.7,系统Alibaba Cloud Linux 3 ,Java项目订阅主题,50个主题,每个主题1秒发送1次,后来又增加了20个主题,Java程序卡死,重启emqx,在重启java,Java程序能收到消息,几秒钟后卡死,是emqx死了,最近的一次排查发现是一启动申请28G的内存,想配置一下内存大小。最开始排查问题,说是消费能力不足,导致的卡死。后来又把Java客户端id变成随机,消息全是Qos 0,但是现在把保存到数据库的逻辑注释掉,也是几秒钟,卡死。下面是最近几次的排查日志
2026-06-22T13:17:40.805265+08:00 [error] Generic server memsup terminating. Reason: {port_died,normal}. Last message: {‘EXIT’,<0.2445.0>,{port_died,normal}}. State: [{data,[{“Timeout”,60000}]},{items,{“Memory Usage”,[{“Allocated”,31424880640},{“Total”,32842895360}]}},{items,{“Worst Memory User”,[{“Pid”,<0.2029.0>},{“Memory”,230432}]}}].
2026-06-22T13:17:40.805684+08:00 [critical] msg: received_terminate_signal
2026-06-22T13:17:40.820720+08:00 [error] crasher: initial call: memsup:init/1, pid: <0.2444.0>, registered_name: memsup, exit: {{port_died,normal},[{gen_server,handle_common_reply,8,[{file,“gen_server.erl”},{line,1226}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,241}]}]}, ancestors: [os_mon_sup,<0.2442.0>], message_queue_len: 0, messages: , links: [<0.2443.0>], dictionary: , trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 1608638; neighbours:
2026-06-22T13:17:40.835430+08:00 [error] Supervisor: {local,os_mon_sup}. Context: shutdown_error. Reason: {port_died,normal}. Offender: id=memsup,pid=<0.2444.0>.
2026-06-22T13:17:40.859410+08:00 [warning] OS_MON (cpu_sup) called by <0.2621.0>, not started
2026-06-22T13:19:25.003251+08:00 [warning] clientid: gansu-mqtt-client1c5ef7dc0-d27b-41c8-9e53-00beb1c59a82, msg: socket_error, peername: 111.92.2.220:37810, reason: timeout
2026-06-22T13:21:44.603795+08:00 [warning] msg: alarm_is_activated, message: <<“connection congested: #{memory => 42600,message_queue_len => 0,pid => <<"<0.6122.0>">>,reductions => 2690704,send_pend => 909900,peername => <<"111.92.2.220:43238">>,sockname => <<"172.29.100.162:1883">>,buffer => 4096,high_msgq_watermark => 8192,high_watermark => 1048576,recbuf => 131072,sndbuf => 243712,recv_cnt => 2,recv_oct => 92,send_cnt => 6812,send_oct => 2583686,username => undefined”…>>, name: <<“conn_congestion/gansu-mqtt-client1d064d479-cc7d-4957-901e-dedd7da786f3/undefined”>>
2026-06-22T13:22:00.187986+08:00 [warning] clientid: gansu-mqtt-client1d064d479-cc7d-4957-901e-dedd7da786f3, msg: socket_error, peername: 111.92.2.220:43238, reason: timeout
2026-06-22T13:22:00.188516+08:00 [warning] msg: alarm_is_deactivated, name: <<“conn_congestion/gansu-mqtt-client1d064d479-cc7d-4957-901e-dedd7da786f3/undefined”>>
2026-06-22T13:29:59.450014+08:00 [critical] msg: received_terminate_signal
2026-06-22T13:29:59.465440+08:00 [error] Generic server memsup terminating. Reason: {port_died,normal}. Last message: {‘EXIT’,<0.2443.0>,{port_died,normal}}. State: [{data,[{“Timeout”,60000}]},{items,{“Memory Usage”,[{“Allocated”,31215046656},{“Total”,32842895360}]}},{items,{“Worst Memory User”,[{“Pid”,<0.2846.0>},{“Memory”,4720664}]}}].
2026-06-22T13:29:59.467814+08:00 [error] crasher: initial call: memsup:init/1, pid: <0.2442.0>, registered_name: memsup, exit: {{port_died,normal},[{gen_server,handle_common_reply,8,[{file,“gen_server.erl”},{line,1226}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,241}]}]}, ancestors: [os_mon_sup,<0.2440.0>], message_queue_len: 0, messages: , links: [<0.2441.0>], dictionary: , trap_exit: true, status: running, heap_size: 4185, stack_size: 28, reductions: 568323; neighbours:
2026-06-22T13:29:59.468162+08:00 [error] Supervisor: {local,os_mon_sup}. Context: shutdown_error. Reason: {port_died,normal}. Offender: id=memsup,pid=<0.2442.0>.
2026-06-22T13:29:59.497132+08:00 [warning] OS_MON (cpu_sup) called by <0.2619.0>, not started
2026-06-22T13:30:18.440136+08:00 [warning] tag: AUTHN, clientid: ipcs_2_data, msg: authentication_failure, peername: 112.25.55.8:50894, username: zhongzi, reason: not_authorized
2026-06-22T13:30:18.446448+08:00 [warning] clientid: lthscgj_ipcs_reset1782105944, msg: authorization_not_initialized, peername: 112.25.55.8:57005, username: meiyou, topic: LTHGJ(L)-jiasudu/snapshot/D/blob, source: emqx_authz
2026-06-22T13:30:18.447416+08:00 [warning] tag: AUTHZ, clientid: lthscgj_ipcs_reset1782105944, msg: cannot_publish_to_topic_due_to_not_authorized, peername: 112.25.55.8:57005, username: meiyou, topic: LTHGJ(L)-jiasudu/snapshot/D/blob, reason: not_authorized
2026-06-22T13:31:40.177801+08:00 [critical] msg: received_terminate_signal
2026-06-22T13:31:40.179013+08:00 [warning] msg: long_schedule, info: [{timeout,63912},{in,{erpc,mcall_receive_replies,5}},{out,{erlang,port_command,3}}], procinfo: [{pid,<0.19347.0>}]
2026-06-22T13:31:40.178803+08:00 [warning] msg: log_events_throttled_during_last_period, dropped: #{cannot_publish_to_topic_due_to_not_authorized => 17}, period: 1 minutes, 0 seconds
2026-06-22T13:31:40.180064+08:00 [error] Generic server memsup terminating. Reason: {port_died,normal}. Last message: {‘EXIT’,<0.2443.0>,{port_died,normal}}. State: [{data,[{“Timeout”,60000}]},{items,{“Memory Usage”,[{“Allocated”,30277451776},{“Total”,32842895360}]}},{items,{“Worst Memory User”,[{“Pid”,<0.2029.0>},{“Memory”,4720520}]}}].
2026-06-22T13:31:40.184568+08:00 [error] crasher: initial call: memsup:init/1, pid: <0.2442.0>, registered_name: memsup, exit: {{port_died,normal},[{gen_server,handle_common_reply,8,[{file,“gen_server.erl”},{line,1226}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,241}]}]}, ancestors: [os_mon_sup,<0.2440.0>], message_queue_len: 3, messages: [time_to_collect,{‘$gen_call’,{<0.2619.0>,#Ref<0.502469574.1123024900.48364>},get_system_memory_data},{‘EXIT’,<0.2441.0>,shutdown}], links: [<0.2441.0>], dictionary: [{system_memory_high_watermark,set}], trap_exit: true, status: running, heap_size: 4185, stack_size: 28, reductions: 91932; neighbours:
2026-06-22T13:31:40.184969+08:00 [error] Supervisor: {local,os_mon_sup}. Context: shutdown_error. Reason: {port_died,normal}. Offender: id=memsup,pid=<0.2442.0>.
2026-06-22T13:31:40.201212+08:00 [warning] OS_MON (cpu_sup) called by <0.2619.0>, not started
2026-06-22T13:32:35.534987+08:00 [warning] msg: alarm_is_activated, message: <<“connection congested: #{memory => 16960,message_queue_len => 0,pid => <<"<0.12044.0>">>,reductions => 27249,send_pend => 47066,peername => <<"182.92.212.220:56548">>,sockname => <<"172.29.100.162:1883">>,buffer => 4096,high_msgq_watermark => 8192,high_watermark => 1048576,recbuf => 131072,sndbuf => 87040,recv_cnt => 2,recv_oct => 92,send_cnt => 62,send_oct => 298357,username => undefined,clien”…>>, name: <<“conn_congestion/gansu-mqtt-client12e7ec21a-4cec-4dcf-a87a-716989146891/undefined”>>
2026-06-22T13:39:49.993224+08:00 [critical] msg: received_terminate_signal
2026-06-22T13:39:49.993024+08:00 [error] Generic server memsup terminating. Reason: {port_died,normal}. Last message: {‘EXIT’,<0.2439.0>,{port_died,normal}}. State: [{data,[{“Timeout”,60000}]},{items,{“Memory Usage”,[{“Allocated”,30274830336},{“Total”,32842895360}]}},{items,{“Worst Memory User”,[{“Pid”,<0.2029.0>},{“Memory”,4720520}]}}].
2026-06-22T13:39:49.993619+08:00 [error] crasher: initial call: memsup:init/1, pid: <0.2438.0>, registered_name: memsup, exit: {{port_died,normal},[{gen_server,handle_common_reply,8,[{file,“gen_server.erl”},{line,1226}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,241}]}]}, ancestors: [os_mon_sup,<0.2436.0>], message_queue_len: 1, messages: [{‘$gen_call’,{<0.2615.0>,#Ref<0.194947932.317718530.201308>},get_system_memory_data}], links: [<0.2437.0>], dictionary: [{system_memory_high_watermark,set}], trap_exit: true, status: running, heap_size: 4185, stack_size: 28, reductions: 94440; neighbours:
2026-06-22T13:39:50.003240+08:00 [error] Error in process <0.2442.0> on node ‘emqx@127.0.0.1’ with exit value:, {badarg,[{erlang,port_command,[#Port<0.7>,“u”],[{error_info,#{module => erl_erts_errors}}]},{cpu_sup,port_server_loop,2,[{file,“cpu_sup.erl”},{line,603}]}]}
2026-06-22T13:39:50.000291+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {port_died,normal}. Offender: id=memsup,pid=<0.2438.0>.
2026-06-22T13:39:50.054500+08:00 [warning] clientid: gansu-mqtt-client12e7ec21a-4cec-4dcf-a87a-716989146891, msg: socket_error, peername: 111.92.2.220:56548, reason: timeout
2026-06-22T13:39:50.056639+08:00 [warning] msg: alarm_is_deactivated, name: <<“conn_congestion/gansu-mqtt-client12e7ec21a-4cec-4dcf-a87a-716989146891/undefined”>>
2026-06-22T13:39:52.005446+08:00 [error] Supervisor: {local,os_mon_sup}. Context: shutdown_error. Reason: killed. Offender: id=cpu_sup,pid=<0.2440.0>.
2026-06-22T13:39:52.005476+08:00 [error] Generic server emqx_os_mon terminating. Reason: {{killed,{gen_server,call,[cpu_sup,{“u”,false,false},infinity]}},[{gen_server,call,3,[{file,“gen_server.erl”},{line,419}]},{os_mon,call,3,[{file,“os_mon.erl”},{line,42}]},{cpu_sup,util,0,[{file,“cpu_sup.erl”},{line,128}]},{emqx_os_mon,handle_info,2,[{file,“emqx_os_mon.erl”},{line,139}]},{gen_server,try_handle_info,3,[{file,“gen_server.erl”},{line,1095}]},{gen_server,handle_msg,6,[{file,“gen_server.erl”},{line,1183}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,241}]}]}. Last message: {timeout,#Ref<0.194947932.317718530.148260>,cpu_check}. State: #{cpu_time_ref => #Ref<0.194947932.317718530.148260>,mem_time_ref => #Ref<0.194947932.317718532.165155>,sysmem_high_watermark => 0.7}.
2026-06-22T13:39:52.005772+08:00 [error] crasher: initial call: emqx_os_mon:init/1, pid: <0.2615.0>, registered_name: emqx_os_mon, exit: {{killed,{gen_server,call,[cpu_sup,{“u”,false,false},infinity]}},[{gen_server,call,3,[{file,“gen_server.erl”},{line,419}]},{os_mon,call,3,[{file,“os_mon.erl”},{line,42}]},{cpu_sup,util,0,[{file,“cpu_sup.erl”},{line,128}]},{emqx_os_mon,handle_info,2,[{file,“emqx_os_mon.erl”},{line,139}]},{gen_server,try_handle_info,3,[{file,“gen_server.erl”},{line,1095}]},{gen_server,handle_msg,6,[{file,“gen_server.erl”},{line,1183}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,241}]}]}, ancestors: [emqx_sys_sup,emqx_sup,<0.2453.0>], message_queue_len: 0, messages: , links: [<0.2607.0>], dictionary: , trap_exit: false, status: running, heap_size: 6772, stack_size: 28, reductions: 11931; neighbours:
2026-06-22T13:39:52.006391+08:00 [error] Supervisor: {local,emqx_sys_sup}. Context: child_terminated. Reason: {killed,{gen_server,call,[cpu_sup,{“u”,false,false},infinity]}}. Offender: id=emqx_os_mon,pid=<0.2615.0>.
2026-06-22T14:56:22.436291+08:00 [warning] msg: config_key_not_recognized, unknown_config_keys: os_mon,session

日志更像是系统内存已经接近打满 + 某个订阅客户端发生连接拥塞。

memsup 里这段:

Allocated 31424880640
Total 32842895360

表示 os_mon 看到的系统已分配内存,不等于 EMQX 自己启动就占了 28G。后面的 received_terminate_signal 通常是进程收到了停止信号,先查是不是 systemd、容器或 OOM killer 把 EMQX 停了。

emqx ctl vm memory
emqx ctl clients show gansu-mqtt-client1d064d479-cc7d-4957-901e-dedd7da786f3
free -h
ps -o pid,rss,vsz,cmd -C beam.smp
journalctl -u emqx --since "2026-06-22 13:10:00" --until "2026-06-22 13:35:00"
dmesg -T | grep -i -E "oom|killed process|beam.smp|emqx"

EMQX 配置里可以调的不是“进程最大内存”,而是告警/保护阈值和单客户端进程限制,例如:

sysmon {
  os {
    sysmem_high_watermark = 70%
    procmem_high_watermark = 5%
  }
}
mqtt {
  force_shutdown {
    max_mailbox_size = 1000
    max_heap_size = 32MB
  }
}

sysmem_high_watermark 只触发高内存告警/相关过载保护,不会把 EMQX RSS 锁死;force_shutdown.max_heap_size 是单个客户端连接进程超过堆阈值就踢掉,也不是整个 EMQX 的最大内存。

如果你就是要硬限制 EMQX 总内存,用部署层做:

# systemd 包安装
sudo systemctl edit emqx
# 写入:
[Service]
MemoryMax=8G

Docker 部署就给容器加 --memory=8g --memory-swap=8g。这是硬限额,超过后可能直接被 OOM kill,不会解决卡死。

connection congested ... send_pend => 909900 ... reason: timeout

说明 broker 往这个 Java 订阅客户端发数据时,socket 发送堆积了。70 个主题每秒各 1 条一般不该把 32G 打满,除非 payload 很大、客户端网络/回调线程堵住,或者有离线会话堆积。先确认 cleanStart=truesessionExpiryInterval=0,再贴上面命令输出和 Java 线程 dump。

[root@iZ2zej3rz8yuqcwutp3k58Z ~]# emqx ctl vm memory
memory/total : 152991552
memory/processes : 42087776
memory/processes_used : 41992992
memory/system : 110903776
memory/atom : 2195873
memory/atom_used : 2189408
memory/binary : 991136
memory/code : 51237278
memory/ets : 12690512
[root@iZ2zej3rz8yuqcwutp3k58Z ~]# free -h
total used free shared buff/cache available
Mem: 30Gi 5.0Gi 2.3Gi 227Mi 23Gi 24Gi
Swap: 0B 0B 0B
[root@iZ2zej3rz8yuqcwutp3k58Z ~]# ps -o pid,rss,vsz,cmd -C beam.smp
PID RSS VSZ CMD
696533 262284 3322192 emqx -Bd -spp true -A 4 -IOt 4 -SDio 8 -C multi_time_warp -c true -pc unicode -e 262144 -zdbbl 8192 -Q 1048576 -P 2097152 – -root /usr/lib/emqx -bindir /usr/lib/
[root@iZ2zej3rz8yuqcwutp3k58Z ~]# journalctl -u emqx --since “2026-06-22 13:10:00” --until “2026-06-22 13:35:00”
– Logs begin at Thu 2025-12-04 11:08:39 CST, end at Mon 2026-06-22 16:16:13 CST. –
6月 22 13:17:11 iZ2zej3rz8yuqcwutp3k58Z systemd[1]: Stopping emqx daemon…
6月 22 13:17:26 iZ2zej3rz8yuqcwutp3k58Z bash[3652030]: Node ‘emqx@127.0.0.1’ not responding to pings.
6月 22 13:17:40 iZ2zej3rz8yuqcwutp3k58Z bash[3653765]: Node ‘emqx@127.0.0.1’ not responding to pings.
6月 22 13:17:40 iZ2zej3rz8yuqcwutp3k58Z bash[3651947]: ERROR: Graceful shutdown failed PID=
6月 22 13:17:40 iZ2zej3rz8yuqcwutp3k58Z bash[1840123]: 2026-06-22T13:17:40+08:00 received_terminate_signal, shutting down now.
6月 22 13:17:41 iZ2zej3rz8yuqcwutp3k58Z bash[1840123]: Stop listener http:dashboard on :18083 successfully.
6月 22 13:17:41 iZ2zej3rz8yuqcwutp3k58Z bash[1840123]: Listener tcp:default on 0.0.0.0:1883 stopped.
6月 22 13:17:41 iZ2zej3rz8yuqcwutp3k58Z bash[1840123]: Listener ssl:default on 0.0.0.0:8883 stopped.
6月 22 13:17:41 iZ2zej3rz8yuqcwutp3k58Z bash[1840123]: Listener ws:default on 0.0.0.0:8083 stopped.
6月 22 13:17:41 iZ2zej3rz8yuqcwutp3k58Z bash[1840123]: Listener wss:default on 0.0.0.0:8084 stopped.
6月 22 13:17:42 iZ2zej3rz8yuqcwutp3k58Z systemd[1]: emqx.service: Succeeded.
6月 22 13:17:42 iZ2zej3rz8yuqcwutp3k58Z systemd[1]: Stopped emqx daemon.
6月 22 13:17:49 iZ2zej3rz8yuqcwutp3k58Z systemd[1]: Started emqx daemon.
6月 22 13:17:50 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: WARNING: Default (insecure) Erlang cookie is in use.
6月 22 13:17:50 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: WARNING: Configure node.cookie in /etc/emqx/emqx.conf or override from environment variable EMQX_NODE__COOKIE
6月 22 13:17:50 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: WARNING: NOTE: Use the same cookie for all nodes in the cluster.
6月 22 13:17:53 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: Listener tcp:default on 0.0.0.0:1883 started.
6月 22 13:17:53 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: Listener ssl:default on 0.0.0.0:8883 started.
6月 22 13:17:53 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: Listener ws:default on 0.0.0.0:8083 started.
6月 22 13:17:53 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: Listener wss:default on 0.0.0.0:8084 started.
6月 22 13:17:53 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: Listener http:dashboard on :18083 started.
6月 22 13:17:53 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: EMQX 5.8.7 is running now!
6月 22 13:20:33 iZ2zej3rz8yuqcwutp3k58Z systemd[1]: Stopping emqx daemon…
6月 22 13:20:34 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: Stop listener http:dashboard on :18083 successfully.
6月 22 13:20:34 iZ2zej3rz8yuqcwutp3k58Z bash[3656519]: Listener tcp:default on 0.0.0.0:1883 stopped.
[root@iZ2zej3rz8yuqcwutp3k58Z ~]# dmesg -T | grep -i -E “oom|killed process|beam.smp|emqx”
[一 6月 22 09:59:26 2026] task:oom_reaper state:S stack: 0 pid: 38 ppid: 2 flags:0x00004000
[一 6月 22 09:59:26 2026] ? oom_reap_task_mm+0x160/0x160
[一 6月 22 09:59:26 2026] oom_reaper+0x126/0x190
[一 6月 22 09:59:26 2026] ? oom_reap_task_mm+0x160/0x160
[一 6月 22 09:59:26 2026] ? uart_write_room+0x57/0xd0
[一 6月 22 09:59:26 2026] task:beam.smp state:T stack: 0 pid:1840123 ppid: 1 flags:0x00004000
[一 6月 22 09:59:26 2026] rt_rq[0]:/system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_ssig_disp 1840678 10277.079649 3 120 0.000000 0.085597 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dcpus_1 1840689 10277.030007 490 120 0.000000 164.063419 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dcpus_2 1840690 10277.024461 103 120 0.000000 0.303666 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dcpus_4 1840692 10277.024694 114 120 0.000000 3.740734 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dios_4 1840696 10277.024393 14534 120 0.000000 44.001095 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dios_7 1840699 10277.024209 12163 120 0.000000 45.613064 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_sched_3 1840891 10277.025032 3 120 0.000000 0.014022 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] S cpu_sup 1840965 9512.076519 23 120 0.000000 2.991841 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] S inet_gethost 1842045 1783.613323 3 120 0.000000 0.980968 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] rt_rq[1]:/system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_async_2 1840681 9659.096719 2 120 0.000000 0.091132 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_sched_2 1840686 9666.155197 62846 120 0.000000 6418.791516 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dios_1 1840693 9659.051147 13537 120 0.000000 42.305402 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dios_3 1840695 9659.046596 13374 120 0.000000 45.598843 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dios_5 1840697 9659.045865 14926 120 0.000000 44.917431 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dios_6 1840698 9659.045580 13540 120 0.000000 46.867829 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_sched_3 1840889 9659.046363 3 120 0.000000 0.019094 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_sched_3 1840892 9659.045869 3 120 0.000000 0.011753 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] rt_rq[2]:/system.slice/emqx.service
[一 6月 22 09:59:26 2026] S oom_reaper 38 62.628747 2 120 0.000000 0.000000 0.000000 0 0 /
[一 6月 22 09:59:26 2026] T beam.smp 1840123 12461.659359 164 120 0.000000 56.590658 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_async_1 1840680 12461.602249 2 120 0.000000 0.050913 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_async_3 1840682 12461.597701 2 120 0.000000 0.020725 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_async_4 1840683 12461.597537 2 120 0.000000 0.018710 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_sched_3 1840687 12469.620193 43381 120 0.000000 4188.743926 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_sched_4 1840688 12470.597901 32015 120 0.000000 1933.535120 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dios_2 1840694 12461.597249 17023 120 0.000000 141.544764 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_aux_1 1840701 12461.602349 11924 120 0.000000 184.256803 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_poll_2 1840704 12461.631642 10737 120 0.000000 144.216895 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T bcrypt_worker 1840963 12461.597332 2 120 0.000000 0.012352 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] S erl_child_setup 1840684 2108.146810 7 120 0.000000 1.471677 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] S inet_gethost 1842046 2127.230308 3 120 0.000000 0.285562 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] rt_rq[3]:/system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_smsg_disp 1840679 10517.756935 3 120 0.000000 0.094178 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_sched_1 1840685 10526.702953 268557 120 0.000000 28996.112140 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dcpus_3 1840691 10517.699239 94 120 0.000000 0.239316 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_dios_8 1840700 10517.700135 17708 120 0.000000 616.816935 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_poll_0 1840702 10517.700584 6531 120 0.000000 84.795916 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_poll_1 1840703 10517.700193 6354 120 0.000000 87.542885 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_poll_3 1840705 10517.700567 9669 120 0.000000 129.713960 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T erts_sched_3 1840890 10517.700716 3 120 0.000000 0.016991 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T bcrypt_worker 1840960 10517.699915 2 120 0.000000 0.013399 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T bcrypt_worker 1840961 10517.700168 2 120 0.000000 0.007547 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] T bcrypt_worker 1840962 10517.699878 2 120 0.000000 0.006880 0.000000 0 0 /system.slice/emqx.service
[一 6月 22 09:59:26 2026] S memsup 1840964 10514.208310 1064 120 0.000000 84.068318 0.000000 0 0 /system.slice/emqx.service
最开始使用emqx ctl vm memory 无响应,但是Java消费端还没启动,有些设备一直往mqtt推送,刚刚重启了一下,执行了一些命令

2026-06-22T16:14:09.332167+08:00 [warning] msg: log_events_throttled_during_last_period, period: 1 minutes, 0 seconds, dropped: #{authentication_failure => 8}
2026-06-22T16:14:09.334717+08:00 [error] Generic server memsup terminating. Reason: {port_died,normal}. Last message: {‘EXIT’,<0.2441.0>,{port_died,normal}}. State: [{data,[{“Timeout”,60000}]},{items,{“Memory Usage”,[{“Allocated”,30302543872},{“Total”,32842895360}]}},{items,{“Worst Memory User”,[{“Pid”,<0.2029.0>},{“Memory”,4720520}]}}].
2026-06-22T16:14:09.335548+08:00 [error] crasher: initial call: memsup:init/1, pid: <0.2440.0>, registered_name: memsup, exit: {{port_died,normal},[{gen_server,handle_common_reply,8,[{file,“gen_server.erl”},{line,1226}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,241}]}]}, ancestors: [os_mon_sup,<0.2438.0>], message_queue_len: 1, messages: [time_to_collect], links: [<0.2439.0>], dictionary: [{system_memory_high_watermark,set}], trap_exit: true, status: running, heap_size: 4185, stack_size: 28, reductions: 91468; neighbours:
2026-06-22T16:14:09.336200+08:00 [error] Supervisor: {local,os_mon_sup}. Context: child_terminated. Reason: {port_died,normal}. Offender: id=memsup,pid=<0.2440.0>.
2026-06-22T16:14:09.340443+08:00 [critical] msg: received_terminate_signal
2026-06-22T16:14:09.423312+08:00 [warning] OS_MON (cpu_sup) called by <0.2617.0>, not started
2026-06-22T16:14:20.739507+08:00 [warning] tag: AUTHN, clientid: ipcs_2_data, msg: authentication_failure, peername: 112.25.55.8:61245, username: zhongzi, reason: not_authorized
这是刚刚执行前的日志

mqttConnectOptions.setCleanSession(true); // 不保持会话
Java客户端设置了这个,问了ai,说mqttConnectOptions.setCleanSession(true);与cleanStart=true + sessionExpiryInterval=0等效,Java程序是 MQTT 3.1.1,版本有问题吗,但是之前broker是5.0时候,没问题,最近才出现的这个问题

不是 MQTT 3.1.1 版本问题,setCleanSession(true) 这点是对的。
在 MQTT 3.x 里,Clean Session 为 true 就是断开后会话立即过期;这条线基本可以排除:不会保留订阅,也不会堆离线消息。

新贴输出说明两件事:

  1. EMQX 没有一启动占 28G。emqx ctl vm memory 只有约 153 MB,ps RSS 只有约 262 MB,free 里 available 还有 24Gi。memsup 里的 Allocated 30G 是 OS 层内存视图,不是 BEAM 进程 RSS。

  2. 13:17 那次是外部触发停止:journalctl 里已经有 systemd[1]: Stopping emqx daemon...。这不是 OOM killer 直接杀进程。
    你得先看看 “谁在 stop/restart EMQX”查出来:

journalctl -u emqx --since "2026-06-22 16:10:00" --until "2026-06-22 16:16:00"
last -x | head -50
systemctl list-timers --all | grep -i emqx

另外,Java 卡死还是按连接拥塞查。connection congested ... send_pend => 909900 表示 EMQX 往客户端 socket 写不出去,不是会话离线消息问题。常见原因是客户端回调线程被阻塞、网络发送慢、或者单连接订阅太多主题后处理不过来。
复现卡死时同时抓这两边:

# EMQX 节点上
emqx ctl clients show gansu-mqtt-client1d064d479-cc7d-4957-901e-dedd7da786f3
ss -tinp | grep 1883
# Java 客户端机器上
jstack -l <java_pid> > /tmp/mqtt-client.jstack

clients show 输出和 Java 线程栈贴出来。重点看 Paho 的回调线程是不是卡在数据库、锁、日志、HTTP 调用或业务队列 put/take 上。