emqx集群客户端连接数量不对问题

测试环境:
测试emqx版本:5.6.1

服务器a:
	ip地址:192.168.111.1,
	操作系统:ubuntu18.04
	内存大小:32G
	cpu型号:Inter I5-12400
	
服务器b:
	ip地址:192.168.111.2,
	操作系统:ubuntu18.04
	内存大小:32G
	cpu型号:Inter I5-12400

服务器c:
	ip地址:192.168.111.2,
	操作系统:ubuntu18.04
	内存大小:16G
	cpu型号:Inter I5-12400


在服务器a、b上面单独安装emqx5.6.1,用emqtt_bench 1秒发送1k数据,
每台服务器最多能支撑到4.5w客户端。

 服务器a、b做static集群,通过c服务器nginx反向代理到服务器a、b,
怎么不是最多能支持到9w,还是4.5w

怎么样搭建集群测试环境才能满足更多客户端连接并稳定发送消息,麻烦
官方给一个准确回复。

那麻烦您提供一下相关的日志。没有日志,我就给不了准确的回复。只能说:
看起来像你的 nginx 和 emqx 服务器没有做系统调优。

1 个赞

服务器确认都已按文档系统调优 | EMQX 5.6 文档
做了系统相关的调优并且都是千兆网卡
a服务器收发流量流量在260M左右,b服务器也是,c服务器620M左右

通过其他两台压力机进行emqtt_bench pub发送消息打压,一个订阅

主要问题
1.订阅段会报client(1): EXIT for {shutdown,tcp_closed},订阅失败,
表现形式:客户端在测试20分钟之后,报错掉线验证

2.服务器有预警信息

预警信息1:
connection congested: #{memory => 37176,message_queue_len => 2,pid => <<“<0.5494659.0>”>>,reductions => 15447,send_pend =>
1082,peername => <<“192.168.111.***:48204”>>,sockname => <<“192.168.111.***1883”>>,buffer => 4096,high_msgq_watermark => 8192,high_watermark =>
1048576,recbuf => 374400,sndbuf => 87040,recv_cnt => 2,recv_oct => 100,send_cnt => 16,send_oct => 29528,username => <<“cloudnetlot”>>,clientid =>
<<“ubuntu_bench_sub_4111510873_1”>>,socktype => tcp,conn_state => connected,proto_name => <<“MQTT”>>,proto_ver => 5,connected_at => 1736935626420}
预警信息2:
connection congested: #{memory => 26736,message_queue_len => 0,pid => <<“<0.475604.0>”>>,reductions => 17601,send_pend => 4708,peername =>
<<“192.168.111.222:29094”>>,sockname => <<“192.168.111.221:1883”>>,buffer => 4096,high_msgq_watermark => 8192,high_watermark =>
1048576,recbuf => 374400,sndbuf => 87040,recv_cnt => 2,recv_oct => 100,send_cnt => 21,send_oct => 32672,username => <<“cloudnetlot”>>,
clientid => <<“ubuntu_bench_sub_3912537162_1”>>,socktype => tcp,proto_name => <<“MQTT”>>,
proto_ver => 5,connected_at => 1736935844606,conn_state => connected}

3.压力机客户端pub_overrun是否有影响
12m59s pub_overrun total=270196 rate=535.00/sec

pub_overrun这个是否对测试有影响

服务器日志如下:
2025-01-15T17:54:36.289483+08:00 [warning] msg: cluster_config_fetch_failures, peer_nodes: [‘emqx@192.168.111.221’], self_node: ‘emqx@192.168.111.220’, booting_nodes: [{error,#{node => ‘emqx@192.168.111.221’,wall_clock => {1035,67},msg => “init_conf_load_not_done”,release => “v5.6.1”}}], failed_nodes:
2025-01-15T17:54:36.346483+08:00 [warning] msg: cluster_routing_schema_discovery_failed, reason: Could not determine configured routing storage schema in peer nodes., responses: [{‘emqx@192.168.111.221’,unknown,starting}]
2025-01-15T17:56:34.150021+08:00 [warning] msg: alarm_is_activated, message: <<“connection congested: #{memory => 265712,message_queue_len => 4,pid => <<"<0.3775.0>">>,reductions => 39610836,send_pend => 6600,peername => <<"192.168.111.229:16544">>,sockname => <<"192.168.111.220:1883">>,buffer => 4096,high_msgq_watermark => 8192,high_watermark => 1048576,recbuf => 374400,sndbuf => 104448,recv_cnt => 2,recv_oct => 100,send_cnt => 76770,send_oct => 85444920,username => <<"c”…>>, name: <<“conn_congestion/ubuntu_bench_sub_4101390393_1/cloudnetlot”>>
2025-01-15T17:57:34.152107+08:00 [warning] msg: alarm_is_deactivated, name: <<“conn_congestion/ubuntu_bench_sub_4101390393_1/cloudnetlot”>>
2025-01-15T17:59:12.117038+08:00 [warning] msg: alarm_is_activated, message: <<“connection congested: #{memory => 71320,message_queue_len => 3,pid => <<"<0.3775.0>">>,reductions => 828709294,send_pend => 278,peername => <<"192.168.111.229:16544">>,sockname => <<"192.168.111.220:1883">>,buffer => 4096,high_msgq_watermark => 8192,high_watermark => 1048576,recbuf => 374400,sndbuf => 530944,recv_cnt => 2,recv_oct => 100,send_cnt => 1426559,send_oct => 2028045682,username => <”…>>, name: <<“conn_congestion/ubuntu_bench_sub_4101390393_1/cloudnetlot”>>
2025-01-15T18:00:13.242285+08:00 [warning] msg: alarm_is_deactivated, name: <<“conn_congestion/ubuntu_bench_sub_4101390393_1/cloudnetlot”>>
2025-01-15T18:03:18.252518+08:00 [warning] msg: alarm_is_activated, message: <<“connection congested: #{memory => 124232,message_queue_len => 1,pid => <<"<0.3775.0>">>,reductions => 4048964552,send_pend => 1854,peername => <<"192.168.111.229:16544">>,sockname => <<"192.168.111.220:1883">>,buffer => 4096,high_msgq_watermark => 8192,high_watermark => 1048576,recbuf => 374400,sndbuf => 1114112,recv_cnt => 3,recv_oct => 102,send_cnt => 6668693,send_oct => 10328508682,username”…>>, name: <<“conn_congestion/ubuntu_bench_sub_4101390393_1/cloudnetlot”>>
2025-01-15T18:03:37.343987+08:00 [warning] msg: alarm_is_deactivated, name: <<“conn_congestion/ubuntu_bench_sub_4101390393_1/cloudnetlot”>>
2025-01-15T18:03:37.344144+08:00 [error] supervisor: {esockd_connection_sup,<0.3775.0>}, errorContext: connection_shutdown, reason: #{max => 1000,reason => mailbox_overflow,value => 2485}, offender: [{pid,<0.3775.0>},{name,connection},{mfargs,{emqx_connection,start_link,[#{listener => {tcp,default},limiter => undefined,enable_authn => true,zone => default}]}}]
2025-01-15T18:07:06.422807+08:00 [warning] msg: alarm_is_activated, message: <<“connection congested: #{memory => 37176,message_queue_len => 2,pid => <<"<0.5494659.0>">>,reductions => 15447,send_pend => 1082,peername => <<"192.168.111.222:48204">>,sockname => <<"192.168.111.220:1883">>,buffer => 4096,high_msgq_watermark => 8192,high_watermark => 1048576,recbuf => 374400,sndbuf => 87040,recv_cnt => 2,recv_oct => 100,send_cnt => 16,send_oct => 29528,username => <<"cloudnetl”…>>, name: <<“conn_congestion/ubuntu_bench_sub_4111510873_1/cloudnetlot”>>
2025-01-15T18:07:22.250722+08:00 [error] Process: <0.5494659.0> on node ‘emqx@192.168.111.220’, Context: maximum heap size reached, Max Heap Size: 6291456, Total Heap Size: 168654149, Kill: true, Error Logger: true, Message Queue Len: 0, GC Info: [{old_heap_block_size,66222786},{heap_block_size,58170533},{mbuf_size,44260858},{recent_size,973510},{stack_size,18},{old_heap_size,0},{heap_size,2984850},{bin_vheap_size,4864222},{bin_vheap_block_size,9235836},{bin_old_vheap_size,0},{bin_old_vheap_block_size,3527924}]



根据日志分析:

  1. 订阅端压力过大:日志显示 send_pend 值为 1082,表明订阅端处理能力已达到极限。
  2. 连接中断:最后一张图显示流出速率为 0,推测部分或全部订阅端可能已掉线。

建议解决方案:

  • 分流订阅端负载:EMQX 设计用于处理大量发布端(如设备)定期发送的 MQTT 消息。为了优化性能,建议使用以下方法来分散订阅端的压力:
    • 共享订阅:允许多个订阅者共同处理同一主题的消息,分担负载。
    • 分离主题:设计不同的主题路径,确保每个订阅端只处理特定类型的消息,减少单个订阅端的负担。

PS: 订阅端的处理消息的能力和发布端没有太大的区别。都是只有一条 tcp。假设处理一个包需要 1ms,那 每秒也只能处理 1000 个包。所以分担压力是非常有必要的。

集群方案理论上来说应该是1+1=2的,现在为什么是1+1=1,用单台和两个做集群下来,得到的结果是一样的,麻烦回复下

1个订阅,5w个发布,订阅端在几分钟之后就关掉了

是的。5W 个发布端,对一个订阅端的场景,应该不适合用 emqx。还是那句话:

至于您说的为什么1+1 为什么效果不是 2,如果设计的 topic 就是有热点的,那也只是 对于单个热点来说,他的处理就是有极限,你加的资源再多,也不能被这单个热点用上。

PS: 如果您只是想验证 1 个节点 4.5W 连接,2 个就应该是 9W 连接的话,可以去掉热点,即不要发包(不做发布订阅)。只测连接数。不过我猜这也不是你要的测试场景。

1个订阅,5w个发布,这种场景不适合emqx,emqx扇入场景还是很多的,比如智能家居,云端节订阅的时候节点是很少的,客户端肯定是上百万上千万的,按照这样一个形式,4.5W对一个订阅应该是正常的,那这种扇入场景就没办法了?

非常抱歉让你产生这种误解,其实结合上下文,我都是只想说明

一个订阅端(一条TCP)它的承载都是有限的。

对于这种扇入场景,通常解决方案都是通过

  1. 良好的 topic 设计分流
  2. 共享订阅
  3. 企业版的各种 actions 到不同的数据源(比如 Kafka)

避开上面这个问题。

使用6个共享订阅,对于这种设计,我通过参数配置解决了一部分的预警信息,但还是无法突破单台服务器在5w客户端长时间emqtt_bench测试,所有客户端上线之后,在测试10分钟左右,流入和流出速率回掉到4.6w左右。客户端也存在掉线情况。

我主要通过这些设置来解决部分问题,还有其他的解决方案没?
以下是我的一些设置参数:
listener {
tcp {
external {
recbuf = 20MB
sndbuf = 20MB
buffer = 150MB
backlog = 4096
acceptors = 64
max_connections = 1024000
max_conn_rate = 50000
bytes_rate = “500MB/s”
rate_limit = “500MB/s”
}
}
}

针对日志中send_pend 待处理消息过高,处理不过来,还有什么解决办法没

日志如下:
connection congested: #{memory => 42696,message_queue_len => 1,pid => <<“<0.162004.0>”>>,reductions => 85086237,send_pend => 1038,peername => <<“192.168.111.223:55430”>>,sockname => <<“192.168.111.220:1883”>>,buffer => 4096,high_msgq_watermark => 8192,high_watermark => 1048576,recbuf => 374400,sndbuf => 87040,recv_cnt => 2,recv_oct => 99,send_cnt => 159861,send_oct => 184042620,username => <<“cloudnetlot”>>,clientid => <<“ubuntu_bench_sub_4046898618_1”>>,socktype => tcp,proto_name => <<“MQTT”>>,proto_ver => 5,connected_at => 1737099363895,conn_state => connected}

我觉得没什么办法了。