我配置了集群的discovery_strategy为static,并在seed里使用IP地址配置了6个节点,6个节点分布在2个机房,每个机房有3个节点,我有以下的重启脚本方案:
方案一:
currentDir=$(cd "$(dirname "$0")" 2>/dev/null || exit 1;pwd)
parentDir=$(cd "$(dirname "$currentDir"/../..)" 2>/dev/null || exit 1;pwd)
source "$parentDir"/work/.settings/config.env
source "$parentDir"/work/.settings/node.env
echo "shutting down emqx..."
"$parentDir"/work/bin/emqx stop
# 删除data目录
rm -rf "$parentDir"/work/data
# 设置节点信息
"$currentDir/cluster-node-config.sh"
source "$parentDir"/work/.settings/config.env
source "$parentDir"/work/.settings/node.env
echo "starting emqx...
"$parentDir"/work/bin/emqx start
方案二:
currentDir=$(cd "$(dirname "$0")" 2>/dev/null || exit 1;pwd)
parentDir=$(cd "$(dirname "$currentDir"/../..)" 2>/dev/null || exit 1;pwd)
source "$parentDir"/work/.settings/config.env
source "$parentDir"/work/.settings/node.env
echo "shutting down emqx..."
"$parentDir"/work/bin/emqx stop
# 设置节点信息
"$currentDir/cluster-node-config.sh"
source "$parentDir"/work/.settings/config.env
source "$parentDir"/work/.settings/node.env
echo "starting emqx...
"$parentDir"/work/bin/emqx start
方案三:
currentDir=$(cd "$(dirname "$0")" 2>/dev/null || exit 1;pwd)
parentDir=$(cd "$(dirname "$currentDir"/../..)" 2>/dev/null || exit 1;pwd)
source "$parentDir"/work/.settings/config.env
source "$parentDir"/work/.settings/node.env
# 退出集群
echo "leaving cluster..."
"$parentDir"/work/bin/emqx_ctl cluster leave
echo "shutting down emqx..."
"$parentDir"/work/bin/emqx stop
# 删除data目录
rm -rf "$parentDir"/work/data
# 设置节点信息
"$currentDir/cluster-node-config.sh"
source "$parentDir"/work/.settings/config.env
source "$parentDir"/work/.settings/node.env
echo "starting emqx...
"$parentDir"/work/bin/emqx start
运行结果:
方案一:一个机房的3个节点同时重启时,会有1或2个节点启动失败,日志报错如下:
Generic server ekka_autocluster terminationg.Reason: {bad_return_value,{error,{merge_schema_failed,“Bad cookie in table definition emqx_telemetry: ‘emqx@192.168.1.15’ = {cstruct,emqx_telemetry,set,[‘emqx@192.168.1.31’,‘emqx@192.168.1.15’,‘emqx@192.168.1.12’,‘emqx@192.168.1.13’,‘emqx@192.168.1.32’,‘emqx@192.168.1.34’,0,read_write,false,false,telemetry,[id,uuid],{{1747615888217844927,-57646075203422942,1},'emqx@192.168.1.34},{{7,0},{‘emqx@192.168.23.31’,{1747,616194,653823}}}},‘emqx@192.168.1.12’ = {cstruct,emqx_telemetry,set,[‘emqx@192.168.1.12’],0,read_write,false,false,telemetry,[id,uuid],{{1747616441456448948,-576460752303422429,1},‘emqx@192.168.1.12’},{{2,0},}}\n”}}}.Last message: loop. State: {s,emqx}.
[error] crasher: initial call: ekka_autocluster:init/1, pid:<0.379.0>, registered_name:ekka_autocluster, exit: {{bad_return_value,{error,{merge_schema_failed,“Bad cookie in table definition emqx_telemetry: ‘emqx@192.168.1.15’ = {cstruct,emqx_telemetry,set,[‘emqx@192.168.1.31’,‘emqx@192.168.1.15’,‘emqx@192.168.1.12’,‘emqx@192.168.1.13’,‘emqx@192.168.1.32’,‘emqx@192.168.1.34’],0,read_write,false,false,telemetry,[id,uuid],{{1747615888217844927,-57646075203422942,1},‘emqx@192.168.1.34’,{{7,0},{‘emqx@192.168.1.31’,{1747,616194,653823}}}},‘emqx@192.168.1.12’ = {cstruct,emqx_telemetry,set,[‘emqx@192.168.1.12’],0,read_write,false,false,telemetry,[id,uuid],{{1747616441456448948,-576460752303422429,1},‘emqx@192.168.1.12’},{{2,0},}}\n”}}},[{gen_server,hablde_common_reply,8,[{file,“gen_server.erl”},{line,1230}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,241}]}]},ancestors: [emqx_machine_sup,<0.2307.0>],message_queue_len: 0, messages: , links:, dictionary:[{rand_seed,{#type => exsss,next => #Fun<rand.0.65977474>,bits => 58,uniform => #Fun<rand.1.65977474>,uniform_n => #Fun<rand.2.65977474>,jump => #Fun<rand.3.65977474>},[125485708919800343|233554188845197464]}}], trap_exit: false, status: running, heap_size: 6772, stack_size: 28, reductions:119670; neighbours:
方案二和方案三在三个节点同时启动时,目前的测试结果来看都能启动成功。
请问为什么方案一的重启脚本会报错?
DISCOVERY_STRATEGY为static的配置下,如果多个节点同时启动时,是否会出现集群脑裂的现象?