发现5.1.1版本的emqx 集群脑裂后只要大于105秒后没法治愈，但5.0.17版本发生脑裂后可以治愈

5779170 · 2023 年9 月 13 日 01:54

经测试验证，5.1.1版本的emqx 集群脑裂后只要大于105秒后没法治愈，
但把版本降到5.0.17版本发生脑裂后可以治愈，这应该是属于5.1.1版本的一个bug吧？

zhongwencool · 2023 年9 月 13 日 02:56

您好，麻烦提供一下 5.1.1 相关的报错日志

5779170 · 2023 年9 月 13 日 03:02

这个基本百分百可以复现的，比如5个节点，然后把其中一个节点断网超过2分钟，基本百不百可以复现出来的。

zhongwencool · 2023 年9 月 13 日 03:35

请问有相关的日志么

zhongwencool · 2023 年9 月 13 日 03:53

能分享一下：
集群是如何配置的么？全是core节点，还是core+replicant，
还有相关集群配置，是静态集群，还是etcd/dns ？
最好能传一下emqx.conf文件，这样才方便重现哈

zhongwencool · 2023 年9 月 13 日 05:55

您好，我们在github上也遇到一个类似的问题，
https://github.com/emqx/emqx/issues/11593
请问，这个报错你那环境里有么，
我们这边暂时没法复现，能帮忙加上EMQX_NODE__dist_net_ticktime=10s
及ERL_FLAGS=“-kernel prevent_overlapping_partitions false” 启动再测试一下么？
也欢迎加入Github 讨论哈。

5779170 · 2023 年9 月 13 日 10:04

但我把版本升级到5.2.0发现断网10分钟脑裂能自动治愈，感觉跟版本有关。

5779170 · 2023 年9 月 13 日 10:07

全是core节点，静态集群

zhongwencool · 2023 年9 月 14 日 00:47

好的，非常感谢，我们已经找到原因了。
在v5.2.0上已经部分修复：https://github.com/emqx/mria/pull/158
昨天又进了个PR：https://github.com/emqx/emqx/pull/11595
应该会在5.2.1上修复完整。
非常感谢

5779170 · 2023 年9 月 14 日 01:44

你说的部分修复，是否还存在问题呢？因为我们打算更新版本，怕又出现其它末知的脑裂问题。

zhongwencool · 2023 年9 月 14 日 02:13

推荐更新 v5.2.1（还要等待发布）
520可能在极端的情况下还会有问题.

5779170 · 2023 年9 月 14 日 02:34

请问你可以说下哪种极端情况下呢？我这边也可以测一下，因为目前需要一个稳定的版本进行项目进行下去。

zhongwencool · 2023 年9 月 14 日 02:58

具体的场景我们还没有验证过，原因是Erlang OTP 25 采取了更激进的网络分片方式，导致emqx使用的global 组件不能像以前我们用的OTP 24 那样，少量的网络抖动可能就会引起global组件判定为脑裂，所以我们在最新的PR里面，把这个行为还原为了OTP 24一样的机制。

github.com/emqx/emqx

fix(distribution): Set prevent_overlapping_partitions to false

emqx:release-52 ← ieQu1:dev/dont-prevent-overlapping-partitions

opened 03:49PM - 12 Sep 23 UTC

ieQu1

+13 -0

Fixes EMQX-10966 ## Summary  ### <samp>🤖 Gene…rated by Copilot at 1a86c21</samp> Add a new `prevent_overlapping_partitions` option to the `kernel` schema and its description to the i18n file. This option allows users to configure advanced Erlang options for EMQ X. ## PR Checklist Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked: - [ ] Added tests for the changes - [ ] Added property-based tests for code which performs user input validation - [ ] Changed lines covered in coverage report - [ ] Change log has been added to `changes/(ce|ee)/(feat|perf|fix)-<PR-id>.en.md` files - [ ] For internal contributor: there is a jira ticket to track this change - [ ] Created PR to [emqx-docs](https://github.com/emqx/emqx-docs) if documentation update is required, or link to a follow-up jira ticket - [ ] Schema changes are backward compatible ## Checklist for CI (.github/workflows) changes - [ ] If changed package build workflow, pass [this action](https://github.com/emqx/emqx/actions/workflows/build_packages.yaml) (manual trigger) - [ ] Change log has been added to `changes/` dir for user-facing artifacts update

关于OTP 25 的改动更新可以见。
https://www.erlang.org/doc/man/global.html#description

zhangguochao · 2023 年9 月 14 日 03:01

@zhongwencool 问题 emqx 5.1.6 集群脑裂也可能是这个引起的？

zhongwencool · 2023 年9 月 14 日 03:12

不太确定，没看到引起脑裂的完整日志，帖子里面都是脑裂后如何表现的日志。