emqx某个节点负载过高,分配不均

您好,麻烦问下,emqx版本5.7.2,连接数大概700多个,某个节点的负载相比于其他节点过高

连接数在三个节点上均衡吗?如果不均衡可以用重平衡功能:节点疏散与集群负载重平衡 | EMQX文档

如果差不多,那你需要到负载高的节点上,emqx remote_console 进入 emqx 的 console,然后:

observer_cli:start().

看看当前系统是不是存在什么瓶颈。

好的,帮忙看下这个就是cpu最高的节点

再帮忙截两张图片:

  1. 在这个界面输入 mq 回车按照消息队列排序,截个图
  2. 再输入 rr 回车再截一张图。

输入 q 回车可以退出这个界面。连续两次 Ctrl + C 可以退出 console。

mq:

rr:

连接数是均衡的,我们是通过slb轮训打到3台节点上

看起来像是 gen_rpc (节点间消息传输)导致的,现在不太确定是不是个问题,你吧 durable_session / persistent_session 相关的功能关掉试试看。

目前已经上到线上环境了,不能随便重启:joy::joy::joy:

现在不能确定是什么原因,看起来可能是个问题,需要远程看看才能知道。

远程的话我们这边如何协助您?

微信 18253232330

好的,加您了

On the node that with high CPU usage, we create a trace using redbug:

redbug:start("gen_rpc_acceptor -> return").

We found that the gen_rpc_acceptor received message in a high rate:

% 03:28:52 <0.26366990.5>(dead)
% gen_rpc_acceptor:call_middleman(emqx_ds_replication_layer, do_next_v1, [messages,<<"8">>,
 #{1 => 2,2 => 1,
   3 =>
       #{1 => 2,
         2 => [<<"localserver">>,<<"shopId">>,<<"508175243">>],
         3 => 1729213770182000,
         4 => {4080822714,[<<"508175243">>]},
         5 => <<>>}},
 100])

% 03:28:52 <0.26370162.5>(dead)
% gen_rpc_acceptor:call_middleman/3 -> {exit,
                                        {call_middleman_result,
                                         {ok,
                                          #{1 => 2,2 => 1,
                                            3 =>
                                             #{1 => 2,
                                               2 =>
                                                [<<"localserver">>,
                                                 <<"shopId">>,<<"504232254">>],
                                               3 => 1729208970791000,
                                               4 =>
                                                {4080822714,[<<"504232254">>]},
                                               5 => <<>>}},
                                          []}}}

% 03:28:52 <0.26366990.5>(dead)
% gen_rpc_acceptor:call_middleman/3 -> {exit,
                                        {call_middleman_result,
                                         {ok,
                                          #{1 => 2,2 => 1,
                                            3 =>
                                             #{1 => 2,
                                               2 =>
                                                [<<"localserver">>,
                                                 <<"shopId">>,<<"508175243">>],
                                               3 => 1729213770182000,
                                               4 =>
                                                {4080822714,[<<"508175243">>]},
                                               5 => <<>>}},
                                          []}}}

% 03:28:52 <0.26370269.5>(dead)
% gen_rpc_acceptor:call_middleman(emqx_ds_replication_layer, do_next_v1, [messages,<<"6">>,
 #{1 => 2,2 => 1,
   3 =>
       #{1 => 2,
         2 => [<<"localserver">>,<<"shopId">>,<<"508175243">>],
         3 => 1729213770182000,
         4 => {4080822714,[<<"508175243">>]},
         5 =>
             <<243,60,105,186,0,6,36,183,230,83,101,117,95,153,139,251,183,
               161,214,80>>}},
 100])

% 03:28:52 <0.26370269.5>(dead)
% gen_rpc_acceptor:call_middleman/3 -> {exit,
                                        {call_middleman_result,
                                         {ok,
                                          #{1 => 2,2 => 1,
                                            3 =>
                                             #{1 => 2,
                                               2 =>
                                                [<<"localserver">>,
                                                 <<"shopId">>,<<"508175243">>],
                                               3 => 1729213770182000,
                                               4 =>
                                                {4080822714,[<<"508175243">>]},
                                               5 =>
                                                <<243,60,105,186,0,6,36,183,
                                                  230,83,101,117,95,153,139,
                                                  251,183,161,214,80>>}},
                                          []}}}

% 03:28:52 <0.26370253.5>(dead)
% gen_rpc_acceptor:call_middleman(emqx_ds_replication_layer, do_next_v1, [messages,<<"10">>,
 #{1 => 2,2 => 1,
   3 =>
       #{1 => 2,
         2 => [<<"localserver">>,<<"shopId">>,<<"508175243">>],
         3 => 1729213770182000,
         4 => {4080822714,[<<"508175243">>]},
         5 =>
             <<243,60,105,186,0,6,36,183,230,83,101,117,95,153,139,251,183,
               169,96,0>>}},
 100])

% 03:28:52 <0.26370253.5>(dead)
% gen_rpc_acceptor:call_middleman/3 -> {exit,
                                        {call_middleman_result,
                                         {ok,
                                          #{1 => 2,2 => 1,
                                            3 =>
                                             #{1 => 2,
                                               2 =>
                                                [<<"localserver">>,
                                                 <<"shopId">>,<<"508175243">>],
                                               3 => 1729213770182000,
                                               4 =>
                                                {4080822714,[<<"508175243">>]},
                                               5 =>
                                                <<243,60,105,186,0,6,36,183,
                                                  230,83,101,117,95,153,139,
                                                  251,183,169,96,0>>}},
                                          []}}}
redbug done, msg_count - 5
v5.7.2(emqx@10.128.17.32)2> 

And there’s also a warning log message on one of the node with normal CPU load:

{"time":1729219792582434,"level":"warning","msg":"emqx_persistent_session_ds_replay_inconsistency","clientid":"localserver/clientId/505050102","username":"test","got":"{srs,<<\"10\">>,1,#{1 => 2,2 => <<\"10\">>,3 => #{1 => 2,2 => 1,3 => #{1 => 2,2 => [<<\"localserver\">>,<<\"shopId\">>,<<\"505050102\">>],3 => 1729213264491000,4 => {4080822714,[<<\"505050102\">>]},5 => <<243,60,105,186,0,6,36,183,87,236,171,203,191,61,63,26,229,248,171,40>>}}},#{1 => 2,2 => <<\"10\">>,3 => #{1 => 2,2 => 1,3 => #{1 => 2,2 => [<<\"localserver\">>,<<\"shopId\">>,<<\"505050102\">>],3 => 1729213264491000,4 => {4080822714,[<<\"505050102\">>]},5 => <<243,60,105,186,0,6,36,183,89,198,112,132,242,211,158,118,188,117,20,24>>}}},1,11,0,12,0,false,1}","expected":"{srs,<<\"10\">>,1,#{1 => 2,2 => <<\"10\">>,3 => #{1 => 2,2 => 1,3 => #{1 => 2,2 => [<<\"localserver\">>,<<\"shopId\">>,<<\"505050102\">>],3 => 1729213264491000,4 => {4080822714,[<<\"505050102\">>]},5 => <<243,60,105,186,0,6,36,183,87,236,171,203,191,61,63,26,229,248,171,40>>}}},#{1 => 2,2 => <<\"10\">>,3 => #{1 => 2,2 => 1,3 => #{1 => 2,2 => [<<\"localserver\">>,<<\"shopId\">>,<<\"505050102\">>],3 => 1729213264491000,4 => {4080822714,[<<\"505050102\">>]},5 => <<243,60,105,186,0,6,36,183,89,210,48,221,252,214,168,98,223,164,156,168>>}}},1,11,0,12,0,false,1}","mfa":"{emqx_persistent_session_ds,replay_batch,3}","peername":"39.188.10.240:22121","pid":"<0.1926884.0>","line":651}

The user is using the durable session feature:

durable_sessions.enable = true

The total message rate is very low:

And they have less than 1K subscribers:

The topic example: