replicant节点无法加入集群

zhangguochao · 2023 年5 月 31 日 06:02

环境

EMQX 版本：5.0.25
操作系统版本： 5.10.178-162.673.amzn2.aarch64

重现此问题的步骤

建立3个core，10个replicant
缩减成0个replicant节点
增加10个replicant节点

预期行为

replicant正常加入集群

实际行为

2023-05-31T05:59:16.293826+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:19.284816+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:22.391515+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:25.284789+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:28.299784+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:31.304818+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:34.280598+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:37.280142+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:40.302780+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:43.294644+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:46.291294+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:49.292347+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-05-31T05:59:49.295833+00:00 [error] State machine ‘$mria_meta_shard’ terminating. Reason: {timeout,{gen_server,call,[mria_lb,core_nodes,30000]}}. Stack: [{gen_server,call,3,[{file,“gen_server.erl”},{line,247}]},{mria_rlog_replica,try_connect,2,[{file,“mria_rlog_replica.erl”},{line,378}]},{mria_rlog_replica,handle_reconnect,1,[{file,“mria_rlog_replica.erl”},{line,341}]},{gen_statem,loop_state_callback,11,[{file,“gen_statem.erl”},{line,1205}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]. Last event: {state_timeout,reconnect}. State: {disconnected,{d,‘$mria_meta_shard’,<0.1976.0>,undefined,undefined,undefined,0,undefined,undefined,false}}.
2023-05-31T05:59:49.296110+00:00 [error] crasher: initial call: mria_rlog_replica:init/1, pid: <0.1977.0>, registered_name: ‘$mria_meta_shard’, exit: {{timeout,{gen_server,call,[mria_lb,core_nodes,30000]}},[{gen_server,call,3,[{file,“gen_server.erl”},{line,247}]},{mria_rlog_replica,try_connect,2,[{file,“mria_rlog_replica.erl”},{line,378}]},{mria_rlog_replica,handle_reconnect,1,[{file,“mria_rlog_replica.erl”},{line,341}]},{gen_statem,loop_state_callback,11,[{file,“gen_statem.erl”},{line,1205}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,226}]}]}, ancestors: [<0.1976.0>,mria_shards_sup,mria_rlog_sup,mria_sup,<0.1911.0>], message_queue_len: 0, messages: [], links: [<0.1976.0>], dictionary: [{‘$logger_metadata$’,#{domain => [mria,rlog,replica],shard => ‘$mria_meta_shard’}}], trap_exit: true, status: running, heap_size: 10958, stack_size: 28, reductions: 9080; neighbours:
2023-05-31T05:59:49.296525+00:00 [error] Supervisor: {<0.1976.0>,mria_replicant_shard_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[mria_lb,core_nodes,30000]}}. Offender: id=replica,pid=<0.1977.0>.
2023-05-31T05:59:49.296693+00:00 [error] Supervisor: {<0.1976.0>,mria_replicant_shard_sup}. Context: shutdown. Reason: reached_max_restart_intensity. Offender: id=replica,pid=<0.1977.0>.
2023-05-31T05:59:49.296821+00:00 [error] Supervisor: {local,mria_shards_sup}. Context: child_terminated. Reason: shutdown. Offender: id=‘$mria_meta_shard’,pid=<0.1976.0>.
2023-05-31T05:59:52.311724+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}

zhongwencool · 2023 年5 月 31 日 08:54

可以试试最新的v5.0.26，上面修了一个类似相关的报错。

zhangguochao · 2023 年6 月 7 日 05:45

@zhongwencool 5.0.26没有解决问题；

2023-06-07T05:43:12.521503+00:00 [error] State machine '$mria_meta_shard' terminating. Reason: {timeout,{gen_server,call,[mria_lb,core_nodes,30000]}}. Stack: [{gen_server,call,3,[{file,"gen_server.erl"},{line,247}]},{mria_rlog_replica,try_connect,2,[{file,"mria_rlog_replica.erl"},{line,378}]},{mria_rlog_replica,handle_reconnect,1,[{file,"mria_rlog_replica.erl"},{line,341}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1205}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]. Last event: {state_timeout,reconnect}. State: {disconnected,{d,'$mria_meta_shard',<0.1988.0>,undefined,undefined,undefined,0,undefined,undefined,false}}.
2023-06-07T05:43:12.522146+00:00 [error] crasher: initial call: mria_rlog_replica:init/1, pid: <0.1989.0>, registered_name: '$mria_meta_shard', exit: {{timeout,{gen_server,call,[mria_lb,core_nodes,30000]}},[{gen_server,call,3,[{file,"gen_server.erl"},{line,247}]},{mria_rlog_replica,try_connect,2,[{file,"mria_rlog_replica.erl"},{line,378}]},{mria_rlog_replica,handle_reconnect,1,[{file,"mria_rlog_replica.erl"},{line,341}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1205}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [<0.1988.0>,mria_shards_sup,mria_rlog_sup,mria_sup,<0.1909.0>], message_queue_len: 0, messages: [], links: [<0.1988.0>], dictionary: [{'$logger_metadata$',#{domain => [mria,rlog,replica],shard => '$mria_meta_shard'}}], trap_exit: true, status: running, heap_size: 10958, stack_size: 28, reductions: 21032; neighbours: []
2023-06-07T05:43:12.522915+00:00 [error] Supervisor: {<0.1988.0>,mria_replicant_shard_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[mria_lb,core_nodes,30000]}}. Offender: id=replica,pid=<0.1989.0>.
2023-06-07T05:43:12.523193+00:00 [error] Supervisor: {<0.1988.0>,mria_replicant_shard_sup}. Context: shutdown. Reason: reached_max_restart_intensity. Offender: id=replica,pid=<0.1989.0>.
2023-06-07T05:43:12.523408+00:00 [error] Supervisor: {local,mria_shards_sup}. Context: child_terminated. Reason: shutdown. Offender: id='$mria_meta_shard',pid=<0.1988.0>.
2023-06-07T05:43:42.524484+00:00 [error] State machine '$mria_meta_shard' terminating. Reason: {timeout,{gen_server,call,[mria_lb,core_nodes,30000]}}. Stack: [{gen_server,call,3,[{file,"gen_server.erl"},{line,247}]},{mria_rlog_replica,try_connect,2,[{file,"mria_rlog_replica.erl"},{line,378}]},{mria_rlog_replica,handle_reconnect,1,[{file,"mria_rlog_replica.erl"},{line,341}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1205}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]. Last event: {state_timeout,reconnect}. State: {disconnected,{d,'$mria_meta_shard',<0.2131.0>,undefined,undefined,undefined,0,undefined,undefined,false}}.
2023-06-07T05:43:42.525094+00:00 [error] crasher: initial call: mria_rlog_replica:init/1, pid: <0.2132.0>, registered_name: '$mria_meta_shard', exit: {{timeout,{gen_server,call,[mria_lb,core_nodes,30000]}},[{gen_server,call,3,[{file,"gen_server.erl"},{line,247}]},{mria_rlog_replica,try_connect,2,[{file,"mria_rlog_replica.erl"},{line,378}]},{mria_rlog_replica,handle_reconnect,1,[{file,"mria_rlog_replica.erl"},{line,341}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1205}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [<0.2131.0>,mria_shards_sup,mria_rlog_sup,mria_sup,<0.1909.0>], message_queue_len: 0, messages: [], links: [<0.2131.0>], dictionary: [{'$logger_metadata$',#{domain => [mria,rlog,replica],shard => '$mria_meta_shard'}}], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 20602; neighbours: []
2023-06-07T05:43:42.526016+00:00 [error] Supervisor: {<0.2131.0>,mria_replicant_shard_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[mria_lb,core_nodes,30000]}}. Offender: id=replica,pid=<0.2132.0>.
2023-06-07T05:43:42.526254+00:00 [error] Supervisor: {<0.2131.0>,mria_replicant_shard_sup}. Context: shutdown. Reason: reached_max_restart_intensity. Offender: id=replica,pid=<0.2132.0>.
2023-06-07T05:43:42.526487+00:00 [error] Supervisor: {local,mria_shards_sup}. Context: child_terminated. Reason: shutdown. Offender: id='$mria_meta_shard',pid=<0.2131.0>.
2023-06-07T05:44:12.527544+00:00 [error] State machine '$mria_meta_shard' terminating. Reason: {timeout,{gen_server,call,[mria_lb,core_nodes,30000]}}. Stack: [{gen_server,call,3,[{file,"gen_server.erl"},{line,247}]},{mria_rlog_replica,try_connect,2,[{file,"mria_rlog_replica.erl"},{line,378}]},{mria_rlog_replica,handle_reconnect,1,[{file,"mria_rlog_replica.erl"},{line,341}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1205}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]. Last event: {state_timeout,reconnect}. State: {disconnected,{d,'$mria_meta_shard',<0.2225.0>,undefined,undefined,undefined,0,undefined,undefined,false}}.
2023-06-07T05:44:12.528141+00:00 [error] crasher: initial call: mria_rlog_replica:init/1, pid: <0.2226.0>, registered_name: '$mria_meta_shard', exit: {{timeout,{gen_server,call,[mria_lb,core_nodes,30000]}},[{gen_server,call,3,[{file,"gen_server.erl"},{line,247}]},{mria_rlog_replica,try_connect,2,[{file,"mria_rlog_replica.erl"},{line,378}]},{mria_rlog_replica,handle_reconnect,1,[{file,"mria_rlog_replica.erl"},{line,341}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1205}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [<0.2225.0>,mria_shards_sup,mria_rlog_sup,mria_sup,<0.1909.0>], message_queue_len: 0, messages: [], links: [<0.2225.0>], dictionary: [{'$logger_metadata$',#{domain => [mria,rlog,replica],shard => '$mria_meta_shard'}}], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 20601; neighbours: []
2023-06-07T05:44:12.528905+00:00 [error] Supervisor: {<0.2225.0>,mria_replicant_shard_sup}. Context: child_terminated. Reason: {timeout,{gen_server,call,[mria_lb,core_nodes,30000]}}. Offender: id=replica,pid=<0.2226.0>.
2023-06-07T05:44:12.529108+00:00 [error] Supervisor: {<0.2225.0>,mria_replicant_shard_sup}. Context: shutdown. Reason: reached_max_restart_intensity. Offender: id=replica,pid=<0.2226.0>.
2023-06-07T05:44:12.529289+00:00 [error] Supervisor: {local,mria_shards_sup}. Context: child_terminated. Reason: shutdown. Offender: id='$mria_meta_shard',pid=<0.2225.0>.

zhangguochao · 2023 年6 月 7 日 05:50

hellcox · 2023 年6 月 12 日 08:54

@Shawn 请问有结果没

Shawn · 2023 年6 月 12 日 09:03

是一个已知问题，会修复到 5.1.0 版本。
你在每一个节点上执行 ./bin/emqx exec 'application:set_env(mria, rlog_lb_update_interval, 15000)' 应该可以绕过去。

hellcox · 2023 年6 月 12 日 10:19

@Shawn 请问下修改这个值，对集群的性能、稳定性或其他方面有什么影响么？

zhangguochao · 2023 年6 月 12 日 10:48

@Shawn 命令行没这个command呢？

zhangguochao · 2023 年6 月 12 日 11:02

[root@ip-10-51-4-185 emqx]# ./bin/emqx eval 'application:set_env(mria, rlog_lb_update_interval, 15000)'
ok
[root@ip-10-51-4-185 emqx]# ./bin/emqx eval 'application:get_env(mria, rlog_lb_update_interval)'
{ok,15000}
用这个命令后依然不能加入集群

heeejianbo · 2023 年6 月 13 日 03:08

增加到10个repl节点后，出错的每个节点都执行了么？我曾经本地试过应该是能解决的才对。如果还是不行可以发邮件到 heeejianbo@gmail.com 我们约个线上会议看看

zhangguochao · 2023 年6 月 13 日 06:21

@heeejianbo 我们等5.1.0版本发布了再验证一下，谢谢！

hellcox · 2023 年6 月 26 日 06:15

@JimMoen 请问下这个问题有在 V5.1.0 中修复么？

heeejianbo · 2023 年6 月 27 日 03:26

感谢各位关注，我们已经在 5.1.0 修复了这个问题

zhangguochao · 2023 年7 月 3 日 06:19

@heeejianbo 据我测试，5.1.0还未修复这个问题。

步骤： 1. 启动3 core + 10 replication， 2. 关闭所有replication，3. 启动 3个replication，未加入集群

日志如下

2023-07-03T06:16:27.853756+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:16:30.847682+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:16:33.853121+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:16:36.846967+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:16:38.884960+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:16:41.851448+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:16:44.861896+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:16:47.857659+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:16:50.853660+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:16:53.847682+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
Failed to load config
exit
{timeout,{gen_server,call,[mria_lb,core_nodes,30000]}}
[{gen_server,call,3,[{file,"gen_server.erl"},{line,385}]},
 {emqx_conf_app,cluster_nodes,0,[{file,"emqx_conf_app.erl"},{line,106}]},
 {emqx_conf_app,sync_cluster_conf,0,[{file,"emqx_conf_app.erl"},{line,110}]},
 {emqx_conf_app,init_conf,0,[{file,"emqx_conf_app.erl"},{line,100}]},
 {emqx_conf_app,start,2,[{file,"emqx_conf_app.erl"},{line,32}]},
 {application_master,start_it_old,4,
                     [{file,"application_master.erl"},{line,293}]}]
2023-07-03T06:16:55.527707+00:00 [error] crasher: initial call: application_master:init/4, pid: <0.2188.0>, registered_name: [], exit: {{bad_return,{{emqx_conf_app,start,[normal,[]]},{'EXIT',{{case_clause,undefined},[{emqx_config_logger,tr_console_handler,1,[{file,"emqx_config_logger.erl"},{line,129}]},{emqx_config_logger,tr_handlers,1,[{file,"emqx_config_logger.erl"},{line,124}]},{emqx_config_logger,do_refresh_config,1,[{file,"emqx_config_logger.erl"},{line,48}]},{emqx_conf_app,start,2,[{file,"emqx_conf_app.erl"},{line,39}]},{application_master,start_it_old,4,[{file,"application_master.erl"},{line,293}]}]}}}},[{application_master,init,4,[{file,"application_master.erl"},{line,142}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}, ancestors: [<0.2187.0>], message_queue_len: 1, messages: [{'EXIT',<0.2189.0>,normal}], links: [<0.2187.0>,<0.1773.0>], dictionary: [], trap_exit: true, status: running, heap_size: 610, stack_size: 28, reductions: 215; neighbours:
2023-07-03T06:16:56.851168+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:16:59.891805+00:00 [warning] msg: Dashboard monitor error, mfa: emqx_dashboard_monitor:current_rate/1, line: 144, reason: {noproc,{gen_server,call,[emqx_dashboard_monitor,current_rate,5000]}}
2023-07-03T06:17:00.559102+00:00 [error] Supervisor: {local,mria_rlog_sup}. Context: shutdown_error. Reason: killed. Offender: id=mria_lb,pid=<0.2054.0>.
Logger - error: {removed_failing_handler,console}

heeejianbo · 2023 年7 月 5 日 03:30

这感觉是另外一个问题，如果此时尝试加入集群可以成功么？

heeejianbo · 2023 年7 月 5 日 03:35

replica 有配置什么自动集群的方式么？方便贴下配置和重现步骤嘛？我本地试试

zhangguochao · 2023 年7 月 5 日 03:35

我们用的etcd服务发现，不能成功。

zhangguochao · 2023 年7 月 5 日 03:40

配置文件如下：

## NOTE:
## Configs in this file might be overridden by:
## 1. Environment variables which start with 'EMQX_' prefix
## 2. File $EMQX_NODE__DATA_DIR/configs/cluster-override.conf
## 3. File $EMQX_NODE__DATA_DIR/configs/local-override.conf
##
## The *-override.conf files are overwritten at runtime when changes
## are made from EMQX dashboard UI, management HTTP API, or CLI.
## All configuration details can be found in emqx.conf.example

# 节点配置
node = {
    name = "emqx-2ec1@10.51.3.68"
    cookie = "iGxSH4NZasa7tOjHmoCmhs2rSeRAFGOF"
    data_dir = "data"
    db_role = replicant
    db_backend = rlog
    process_limit = 2048000
    max_ports = 1024000
    dist_buffer_size = 8192
    max_ets_tables = 524288
    crash_dump_file = "log/crash.dump"
    dist_net_ticktime = 5m
    cluster_call = {
      retry_interval = 1m
      max_history = 100
      cleanup_interval = 5m
  }
}

# rpc配置
rpc = {
    mode = async
    async_batch_size = 256
    tcp_server_port = 5369
    tcp_client_num = 96
    port_discovery = manual
    connect_timeout = 3s
    send_timeout = 3s
    authentication_timeout = 3s
    call_receive_timeout = 7s
    socket_keepalive_idle = 15m
    socket_keepalive_interval = 75s
    socket_keepalive_count = 9
    socket_sndbuf = 1MB
    socket_recbuf = 1MB
    socket_buffer = 1MB
}

# 全局mqtt消息配置
mqtt = {
    max_clientid_len = 1024
    max_topic_levels = 7
    max_qos_allowed = 2
    max_topic_alias = 0
    retain_available = true
    wildcard_subscription = false
    shared_subscription = false
    ignore_loop_deliver = false
}

# 集群配置
cluster = {
    name = "emqx_v5_cluster"
    autoheal = true
    autoclean = 5m
    proto_dist = inet_tcp
    discovery_strategy = "etcd"
    etcd = {
      server = "https://etcd.XXXX.com:2379"
      prefix = "emqx-cluster"
      node_ttl = 1m
      ssl = {
        keyfile = "/data/etcdssl/etcd-key.pem"
        cacertfile = "/data/etcdssl/ca.pem"
        certfile = "/data/etcdssl/etcd.pem"
        enable = true
      }
    }
}

# 监听配置
listeners.tcp.default = {
    bind = "0.0.0.0:1883"
    max_connections = 1024000
    proxy_protocol = true
    proxy_protocol_timeout = 3s
    enable_authn = true
    acceptors = 64
    limiter = {
      connection = {
        rate = "237/s"
        initial = 237
        capacity = 1185
        burst = "20"
      }
      client.message_in = {
        rate = "100/s"
        initial = 100
        capacity = 500
      }
    }
    access_rules = ["allow all"]
    tcp_options = {
      active_n = 100
      backlog = 1024
      send_timeout = 7s
      send_timeout_close = true
      nodelay = true
      reuseaddr = true
    }
    zone = external
}

zone.external.mqtt = {
    idle_timeout = 15s
    max_packet_size = 128KB
    exclusive_subscription = false
    use_username_as_clientid = false
    wildcard_subscription = false
    shared_subscription = false
    max_subscriptions = 20
    upgrade_qos = false
    keepalive_backoff = 0.75
    max_inflight = 32
    retry_interval = 10s
    max_awaiting_rel = 100
    await_rel_timeout = 50s
    session_expiry_interval = 5m
    max_mqueue_len = 100
    mqueue_priorities = disabled
    mqueue_default_priority = highest
    mqueue_store_qos0 = true
    ignore_loop_deliver = false
}

listeners.tcp.internal = {
  bind = "0.0.0.0:38811"
  acceptors = 64
  max_connections = 102400
  proxy_protocol = false
  enable_authn = true
  limiter.connection = {
    rate = "237/s"
    burst = "20"
  }
  tcp_options = {
    active_n = 300
    backlog = 1024
    send_timeout = 3s
    send_timeout_close = true
    nodelay = true
    reuseaddr = true
  }
  zone = internal
}

zone.internal.mqtt = {
    wildcard_subscription = true
    shared_subscription = true
    max_subscriptions = infinity
    max_inflight = 128
    max_awaiting_rel = 200
    max_mqueue_len = 2000
    mqueue_store_qos0 = true
    use_username_as_clientid = false
    ignore_loop_deliver = false
}

listeners.ssl.default = {
  bind = "0.0.0.0:8883"
  max_connections = 512000
  ssl_options {
    keyfile = "etc/certs/key.pem"
    certfile = "etc/certs/cert.pem"
    cacertfile = "etc/certs/cacert.pem"
  }
}

listeners.ws.default = {
  bind = "0.0.0.0:8083"
  max_connections = 1024
  acceptors = 8
  proxy_protocol = false
  websocket = {
    mqtt_path = "/mqtt"
    proxy_address_header = x-forwarded-for
    proxy_port_header = x-forwarded-port
  }
  limiter.connection = {
    rate = "100/s"
    burst = "20"
    initial = 100
    capacity = 500
  }
  access_rules = ["allow all"]
}

listeners.wss.default = {
  bind = "0.0.0.0:8084"
  max_connections = 512000
  websocket.mqtt_path = "/mqtt"
  ssl_options = {
    keyfile = "etc/certs/key.pem"
    certfile = "etc/certs/cert.pem"
    cacertfile = "etc/certs/cacert.pem"
  }
}

# 延迟消息
delayed = {
    enable = true
    max_delayed_messages = 5
}

# 保留消息
retainer = {
    enable = true
    msg_expiry_interval = 3h
    msg_clear_interval = 6h
    backend = {
      storage_type = ram
      max_retained_messages = 5
    }
}

# 日志配置
log = {
    file_handlers.default = {
      enable = true
      level = error
      file = "log/emqx.log"
      sync_mode_qlen = 256
      chars_limit = 16384
      formatter = text
      max_size = 64MB
      rotation.count = 10
      burst_limit = {
        enable = true
        max_count = 20480
        window_time = 3s
      }
      overload_kill = {
        enable = true
        mem_size = 50MB
        qlen = 30000
      }
  }
}

# listeners.quic.default {
#  enabled = true
#  bind = "0.0.0.0:14567"
#  max_connections = 1024000
#  ssl_options {
#   verify = verify_none
#   keyfile = "etc/certs/key.pem"
#   certfile = "etc/certs/cert.pem"
#   cacertfile = "etc/certs/cacert.pem"
#  }
# }

# 控制台配置
dashboard = {
    default_username = "admin"
    sample_interval = 10s
    token_expired_time = 60m
    cors = false
    i18n_lang = en
    listeners.http = {
        enable = true
        bind = 38080
        num_acceptors = 8
        # cloud程序38088需要使用这个端口
        max_connections = 1024
        backlog = 1024
        send_timeout = 10s
    }
}

# 授权配置
authorization = {
  deny_action = ignore
  no_match = deny
  sources = [
    {
      type = file
      enable = true
      path = "etc/acl.conf"
    },
    {
      type = http
      method = post
      enable = true
      url = "http://xxxxxx.dev.vpc/auth/EmqxAcl/acl"
      request_timeout = 10s
      connect_timeout = 10s
      enable_pipelining = 237
      pool_size = 8
      body = {
        action = "${action}",
        clientid = "${clientid}",
        from = "emqx5",
        ipaddr = "${peerhost}",
        topic = "${topic}",
        username = "${username}"
      }
      headers = {
         Content-Type = "application/json"
         X-Request-Source = "EMQX"
         accept = "application/json"
         cache-control = "no-cache"
         connection = "keep-alive"
         keep-alive = "timeout=30, max=1000"
      }
    }
  ]
  cache = {
    enable = true
    max_size = 64
    ttl = 15m
  }
}


# 认证配置
authentication = [
  {
    backend = http
    method = post
    mechanism = password_based
    enable = true
    request_timeout = 5s
    connect_timeout = 10s
    enable_pipelining = 100
    pool_size = 8
    url = "http://xxxxx.dev.vpc/auth/emqxAuth/auth"
    body = {
      clientid =  "${clientid}"
      from =  "emqx5"
      ipaddr = "${peerhost}"
      password = "${password}"
      username = "${username}"
    }
    headers = {
      Content-Type = "application/json"
      X-Request-Source = "EMQX"
      accept = "application/json"
      cache-control = "no-cache"
      connection = "keep-alive"
      keep-alive = "timeout=30, max=1000"
    }
  }
]

# 系统监控
sysmon = {
  os = {
    cpu_check_interval = 60s
    cpu_high_watermark = 95%
    cpu_low_watermark = 90%
    mem_check_interval = 60s
    sysmem_high_watermark = 80%
    procmem_high_watermark = 5%
  }
  vm = {
    long_gc = disabled
    long_schedule = 240ms
    large_heap = 32MB
    busy_port = false
    busy_dist_port = true
    process_high_watermark = 80%
    process_low_watermark = 70%
  }
}

# prometheus 配置
prometheus = {
    enable = true
    push_gateway_server = "http://127.0.0.1:9091"
    interval = 15s
}

# Force garbage collection in MQTT connection process after they process certain number of messages or bytes of data.
force_gc = {
    enable = true
    bytes = 16MB
    count = 16000
}

# When the process message queue length, or the memory bytes reaches a certain value, the process is forced to close.
#
# Note: "message queue" here refers to the "message mailbox" of the Erlang process, not the mqueue of QoS 1 and QoS 2.
force_shutdown = {
    enable = true
    max_message_queue_len = 1000
    max_heap_size = 32MB
}

# 事件主题
sys_topics = {
    sys_event_messages = {
      client_connected = false
      client_disconnected = false
    }
}

zhangguochao · 2023 年7 月 5 日 03:41

重现步骤：

启动3个core节点；
启动10个replica节点；等待加入集群成功。
关闭10个replica节点；等待移除集群成功。
启动3个replica节点。

systemd启动

[Unit]
Description=emqx daemon
After=network.target

[Service]
User=root
Group=root

# The ExecStart= is foreground, so 'simple' here
Type=simple
Environment=HOME=/home/emqx/_build/emqx/rel/emqx

# Start 'foreground' but not 'start' (daemon) mode.
# Because systemd monitor/restarts 'simple' services
ExecStart=/bin/bash /home/emqx/_build/emqx/rel/emqx/bin/emqx foreground

# Give EMQX enough file descriptors
LimitNOFILE=1048576

# ExecStop is commented out so systemd will send a SIGTERM when 'systemctl stop'.
# SIGTERM is handled by EMQX and it then performs a graceful shutdown
# It's better than command 'emqx stop' because it needs to ping the node
# ExecStop=/bin/bash /usr/bin/emqx stop

# Wait long enough before force kill for graceful shutdown
TimeoutStopSec=120s

Restart=on-failure

# Do not restart immediately so the peer nodes in the cluster have
# enough time to handle the 'DOWN' events of this node
RestartSec=120s

[Install]

zhangguochao · 2023 年7 月 12 日 09:00

@heeejianbo 大佬，现在这个问题怎么说？