docker emqx cluster join timeout 如何排查

环境信息

  • EMQX 版本:4.0.0
  • 操作系统及版本:CentOS Linux release 7.2.1511 (Core)
  • 其他: docker 19.03.8

问题描述

跑了几个月的emqx(etcd) docker (net=host) 集群今天被oom kill 了 ,重启的该oom kill 的节点时候无法加入集群

配置文件及日志

2022-10-15 16:23:17.242 [info] application: emqx_recon
started_at: ‘emqx4@172.17.110.215’
2022-10-15 16:23:17.242 [info] [Plugins] Started plugins: [emqx_recon]
2022-10-15 16:23:17.242 [info] [Plugins] Load plugin emqx_recon successfully
[‘2022-10-15T16:23:17Z’]:waiting emqx
[‘2022-10-15T16:23:19Z’]:waiting emqx
[‘2022-10-15T16:23:20Z’]:waiting emqx
[‘2022-10-15T16:23:22Z’]:waiting emqx
[‘2022-10-15T16:23:24Z’]:waiting emqx
[‘2022-10-15T16:23:25Z’]:waiting emqx
[‘2022-10-15T16:23:25Z’]:timeout error

可以把data目录的都删,再启动。
PS:推荐使用4.4.x的最新版

删除容器后重新启动也一样加入不到集群

你给的信息太少了,不好定位啊

有点神奇, 重启启动又加入集群了

打开debug,发一下emqx的日志。才能定位。

又出现了 etcd 做注册发现:

EMQ X Broker v4.0.0 is started successfully!
              {mfargs,{prometheus_histogram,start_link,[]}},
              {restart_type,permanent},
              {shutdown,5000},
              {child_type,worker}]
2022-10-27 03:19:52.628 [info]     application: prometheus
    started_at: 'emqx5@172.17.110.217'
2022-10-27 03:19:52.636 [info]     application: luerl
    started_at: 'emqx5@172.17.110.217'
2022-10-27 03:19:52.637 [info]     application: xmerl
    started_at: 'emqx5@172.17.110.217'
2022-10-27 03:19:53.624 [info] Ekka(Membership): Node emqx3@172.17.110.204 up
2022-10-27 03:19:53.627 [info] Ekka(Membership): Node emqx2@172.17.110.203 up
2022-10-27 03:19:53.630 [info] Ekka(Membership): Node emqx4@172.17.110.215 up
2022-10-27 03:19:53.631 [critical] [EMQ X] emqx shutdown for join
2022-10-27 03:19:53.636 [notice]     application: emqx_management
    exited: stopped
    type: temporary
2022-10-27 03:19:53.636 [info] [Plugins] Stop plugin emqx_management successfully
2022-10-27 03:19:53.640 [notice]     application: emqx_recon
    exited: stopped
    type: temporary
2022-10-27 03:19:53.640 [info] [Plugins] Stop plugin emqx_recon successfully
2022-10-27 03:19:53.643 [notice]     application: emqx_retainer
    exited: stopped
    type: temporary
2022-10-27 03:19:53.644 [info] [Plugins] Stop plugin emqx_retainer successfully
2022-10-27 03:19:53.647 [notice]     application: emqx_dashboard
    exited: stopped
    type: temporary
2022-10-27 03:19:53.648 [info] [Plugins] Stop plugin emqx_dashboard successfully
2022-10-27 03:19:53.651 [notice]     application: emqx_auth_username
    exited: stopped
    type: temporary
2022-10-27 03:19:53.651 [info] [Plugins] Stop plugin emqx_auth_username successfully
2022-10-27 03:19:53.655 [notice]     application: emqx_auth_redis
    exited: stopped
    type: temporary
2022-10-27 03:19:53.655 [info] [Plugins] Stop plugin emqx_auth_redis successfully
2022-10-27 03:19:53.659 [notice]     application: emqx_web_hook
    exited: stopped
    type: temporary
2022-10-27 03:19:53.659 [info] [Plugins] Stop plugin emqx_web_hook successfully
2022-10-27 03:19:53.660 [notice] mqtt:tcp stopped on 127.0.0.1:11883

2022-10-27 03:19:53.661 [notice] mqtt:tcp stopped on 0.0.0.0:1883

2022-10-27 03:19:53.662 [notice] mqtt:ssl stopped on 0.0.0.0:8883

2022-10-27 03:19:53.666 [notice]     application: emqx
    exited: stopped
    type: permanent
2022-10-27 03:19:53.668 [notice]     application: ekka
    exited: stopped
    type: temporary
2022-10-27 03:19:53.671 [notice]     application: cowboy
    exited: stopped
    type: permanent
2022-10-27 03:19:53.674 [notice]     application: ranch
    exited: stopped
    type: permanent
2022-10-27 03:19:53.677 [notice]     application: esockd
    exited: stopped
    type: permanent
2022-10-27 03:19:53.679 [notice]     application: gproc
    exited: stopped
    type: permanent
2022-10-27 03:19:53.685 [notice]     application: mnesia
    exited: stopped
    type: temporary
2022-10-27 03:19:53.687 [info]     supervisor: {local,mnesia_sup}
    started: [{pid,<0.1860.0>},
              {id,mnesia_event},
              {mfargs,{mnesia_sup,start_event,[]}},
              {restart_type,permanent},
              {shutdown,30000},
              {child_type,worker}]
2022-10-27 03:19:53.688 [info]     supervisor: {local,mnesia_sup}
    started: [{pid,<0.1861.0>},
              {id,mnesia_ext_sup},
              {mfargs,{mnesia_ext_sup,start,[]}},
              {restart_type,permanent},
              {shutdown,infinity},
              {child_type,supervisor}]
2022-10-27 03:19:53.689 [info]     supervisor: {local,mnesia_kernel_sup}
    started: [{pid,<0.1863.0>},
              {id,mnesia_monitor},
              {mfargs,{mnesia_monitor,start,[]}},
              {restart_type,permanent},
              {shutdown,3000},
              {child_type,worker}]
2022-10-27 03:19:53.689 [info]     supervisor: {local,mnesia_kernel_sup}
    started: [{pid,<0.1864.0>},
              {id,mnesia_subscr},
              {mfargs,{mnesia_subscr,start,[]}},
              {restart_type,permanent},
              {shutdown,3000},
              {child_type,worker}]
2022-10-27 03:19:53.689 [info]     supervisor: {local,mnesia_kernel_sup}
    started: [{pid,<0.1865.0>},
              {id,mnesia_locker},
              {mfargs,{mnesia_locker,start,[]}},
              {restart_type,permanent},
              {shutdown,3000},
              {child_type,worker}]
2022-10-27 03:19:53.689 [info]     supervisor: {local,mnesia_kernel_sup}
    started: [{pid,<0.1866.0>},
              {id,mnesia_recover},
              {mfargs,{mnesia_recover,start,[]}},
              {restart_type,permanent},
              {shutdown,180000},
              {child_type,worker}]
2022-10-27 03:19:53.689 [info]     supervisor: {local,mnesia_kernel_sup}
    started: [{pid,<0.1867.0>},
              {id,mnesia_tm},
              {mfargs,{mnesia_tm,start,[]}},
              {restart_type,permanent},
              {shutdown,30000},
              {child_type,worker}]
2022-10-27 03:19:53.689 [info]     supervisor: {local,mnesia_kernel_sup}
    started: [{pid,<0.1868.0>},
              {id,mnesia_checkpoint_sup},
              {mfargs,{mnesia_checkpoint_sup,start,[]}},
              {restart_type,permanent},
              {shutdown,infinity},
              {child_type,supervisor}]
2022-10-27 03:19:53.690 [info]     supervisor: {local,mnesia_kernel_sup}
    started: [{pid,<0.1869.0>},
              {id,mnesia_controller},
              {mfargs,{mnesia_controller,start,[]}},
              {restart_type,permanent},
              {shutdown,3000},
              {child_type,worker}]
2022-10-27 03:19:53.690 [info]     supervisor: {local,mnesia_kernel_sup}
    started: [{pid,<0.1870.0>},
              {id,mnesia_late_loader},
              {mfargs,{mnesia_late_loader,start,[]}},
              {restart_type,permanent},
              {shutdown,3000},
              {child_type,worker}]
2022-10-27 03:19:53.690 [info]     supervisor: {local,mnesia_sup}
    started: [{pid,<0.1862.0>},
              {id,mnesia_kernel_sup},
              {mfargs,{mnesia_kernel_sup,start,[]}},
              {restart_type,permanent},
              {shutdown,infinity},
              {child_type,supervisor}]
2022-10-27 03:19:53.690 [info]     application: mnesia
    started_at: 'emqx5@172.17.110.217'
['2022-10-27T03:19:54Z']:waiting emqx
['2022-10-27T03:19:56Z']:waiting emqx
['2022-10-27T03:19:58Z']:waiting emqx
['2022-10-27T03:20:00Z']:waiting emqx
['2022-10-27T03:20:02Z']:waiting emqx
['2022-10-27T03:20:04Z']:waiting emqx
['2022-10-27T03:20:04Z']:timeout error

emqx 没起来,如果是停掉了 2 个以上的节点出现的这个问题,怀疑是在等待别的节点启动。

  1. 启动 emqx 节点是有顺序的,要保证最后停掉的节点先启动。或者说把所有节点一起启动。
  2. 重启前后不要更改各个节点的 node name
  3. 推荐使用 emqx 4.4.10,如果出现这种情况,日志里会有 “still waiting for table” 打印,能够定位到问题。

一共4个节点, 其他3个节点都跑了16个月以上目前都是正常状态 , 日志没有still waiting for table 打印, 还有其他的排查办法吗 ? 之前这个节点也出现过一次, 过几天重启启动又好了.


emqx3
191ff2e7d474  emqx/emqx:v4.0.0 17 months ago   Up 16 months                                                                                                                          

emqx2
e5dbb3e8d7a9   emqx/emqx:v4.0.0   17 months ago       Up 16 months

emqx4
44ca6a7cf829   emqx/emqx:v4.0.0   17 months ago       Up 16 months

这个打印是 4.3 以后才有的,所以我说你最好用 4.4.10

你启动节点一直 waiting 的时候,attach 进去那个容器,看看什么情况。
进入容器之后, emqx remote_console 然后 rp(application:which_applications()).

这样吗 ? 好像报错 :smiling_face_with_tear:

Erlang/OTP 22 [erts-10.5.6] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1]

*** ERROR: Shell process terminated! (^G to start new job) ***

User switch command
 -->
 -->
 --> rp(application:which_applications()).
Unknown command

你进入 remote_console 之后,直接敲命令。不要 Ctrl+C 或者 Ctrl + G,那样就退出来了。

现在emqx5 手动启动竟然成功了 ,目前集群只有emqx4 起不来

Cluster status: #{running_nodes =>
                      ['emqx2@172.17.110.203','emqx3@172.17.110.204',
                       'emqx5@172.17.110.217'],
                  stopped_nodes => []}

这里根本没办法输入

/opt/emqx $ emqx remote_console
Erlang/OTP 22 [erts-10.5.6] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]

*** ERROR: Shell process terminated! (^G to start new job) ***

BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
/opt/emqx $ emqx remote_console
rp()^[[DaErlang/OTP 22 [erts-10.5.6] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]

*** ERROR: Shell process terminated! (^G to start new job) ***

加个微信吧?18253232330

加了大佬

节点又退出了: docker 日志

{"log":"sh: invalid number '111444emqx'\n","stream":"stderr","time":"2022-10-28T11:33:12.742691595Z"}

[webedit@java3 ~]$

大佬解决了吗?我的也出现这个问题 sh: invalid number ‘152447emqx’