emq集群的节点会出现emqx_authn_http相关的告警

tywin · 2025 年4 月 17 日 09:48

我的emqx集群部署在云服务器的K8s平台上，使用emqx-operator-controller-manager管理。

以下是我的yaml文件
apiVersion: apps.emqx.io/v2beta1
kind: EMQX
metadata:
name: emqx
namespace: real
spec:
image: emqx:5.5.0
coreTemplate:
spec:
replicas: 3
volumeClaimTemplates:
storageClassName: standard
resources:
requests:
storage: 10Gi
accessModes:
- ReadWriteOnce
extraVolumes:
- name: ssl-self-sign
secret:
secretName: ssl-self-sign
extraVolumeMounts:
- name: ssl-self-sign
mountPath: /mounted/cert
replicantTemplate:
spec:
replicas: 3
extraVolumes:
- name: ssl-self-sign
secret:
secretName: ssl-self-sign
extraVolumeMounts:
- name: ssl-self-sign
mountPath: /mounted/cert
dashboardServiceTemplate:
spec:
type: NodePort
ports:
- name: dashboard
nodePort: 30811
port: 18083
protocol: TCP
targetPort: 18083
listenersServiceTemplate:
spec:
type: NodePort
ports:
- name: ssl-default
nodePort: 32349
port: 8883
protocol: TCP
targetPort: 8883
- name: tcp-default
nodePort: 32350
port: 1883
protocol: TCP
targetPort: 1883
- name: ws-default
nodePort: 32347
port: 8083
protocol: TCP
targetPort: 8083
- name: wss-default
nodePort: 32348
port: 8084
protocol: TCP
targetPort: 8084
- name: tcp-program
nodePort: 32351
port: 1993
protocol: TCP
targetPort: 1993

我在EMQ的监控告警中看到了多条类似下面的告警信息
emqx_authn_http:6269953  resource down: closed emqx@10.244.13.254 系统 2025-04-17 01:28:22

reason: resource_down
resource_id: emqx_authn_http:11126819

想知道这是什么原因造成的

zhongwencool · 2025 年4 月 17 日 12:53

你点那个问号会有更详细的信息的。
一般如果没有 bug 的话，就是你的 http server 和 emqx 之前的连接不稳定。

tywin · 2025 年4 月 17 日 14:28

详细信息就是这个
reason: resource_down
resource_id: emqx_authn_http:11126819
如果是连接不稳定，这种情况可能的原因是什么，这是云上同一个内网中的服务器，应该不存在延迟之类的问题，服务器的资源消耗不算大

tywin · 2025 年4 月 17 日 15:23

我在日志中发现了以下的报错

2025-04-17T22:34:50.261151+08:00 [error] msg: http_connector_get_status_failed, mfa: emqx_bridge_http_connector:do_get_status/2(535), reason: closed, worker: <0.91569875.0>
2025-04-17T22:34:50.264632+08:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3(418), message: <<"resource down: closed">>, name: <<"emqx_authn_http:6287081">>
2025-04-17T22:34:50.264876+08:00 [warning] msg: health_check_failed, mfa: emqx_resource_manager:handle_connected_health_check/1(893), id: <<"emqx_authn_http:6287081">>, status: disconnected

在15秒后，有开始连接的日志

2025-04-17T22:35:05.265923+08:00 [info] msg: starting_http_connector, mfa: emqx_bridge_http_connector:on_start/2(203), config: #{ssl => #{depth => 10,verify => verify_peer,hibernate_after => 5000,enable => false,ciphers => [],versions => ['tlsv1.3','tlsv1.2'],log_level => notice,secure_renegotiate => true,reuse_sessions => true,user_lookup_fun => {fun emqx_tls_psk:lookup/3,undefined}},connect_timeout => 2000,mechanism => password_based,pool_size => 20,enable => true,body => #{password => <<"******">>,userName => <<"${username}">>},headers => #{},url => <<"http://iot-device:41***/iotDevice/deviceAuth/mqttAuth">>,method => get,backend => http,request_timeout => 5000,pool_type => random,enable_pipelining => 100,base_url => #{port => 41***,scheme => http,path => "/",host => "iot-device"}}, connector: <<"emqx_authn_http:6287081">>
2025-04-17T22:35:05.266193+08:00 [warning] msg: emqx_connector_on_start_already_started, mfa: emqx_bridge_http_connector:start_pool/2(256), pool_name: <<"emqx_authn_http:6287081">>
2025-04-17T22:35:05.270613+08:00 [warning] msg: alarm_is_deactivated, mfa: emqx_alarm:do_actions/3(424), name: <<"emqx_authn_http:6287081">>

zhongwencool · 2025 年4 月 18 日 01:45

有用户也遇到过这个问题，统计来看全部的原因都是http server 的实现对 pipline 的支持不够完全。同时发 100 个 request 过去时，有时会有一些请求一直没 response，而且是概率出来的，
过一段时间后，再分成 2 种情况：

这个链接就会完全不回应，且不断开。变成了僵尸进程。
这个链接会死掉（有一定的滞后性）。然后 emqx 会再次重连。这情况和你日志里面的很匹配。

但大部分客户都会坚信自己实现的 http server 不可能出问题（可能因为是概率的）。
所以 EMQX 在5.8.1（推荐升级到 5.8.6）推出了一个检查统计僵尸次数的东西：

github.com/emqx/emqx

chore: upgrade ehttpc from 0.5.0 to 0.6.0

emqx:release-58 ← zmstone:1007-upgrade-ehttpc-from-0.5.0-to-0.6.0

opened 11:31AM - 07 Oct 24 UTC

zmstone

+16 -2

En enhancement to auto-recover: https://github.com/emqx/emqx/issues/12974 Rel…ease version: v/e5.8.1 ## Summary ehttpc-0.6.0 introduced zombie connection detection. see https://github.com/emqx/ehttpc/pull/57 by default, if a http connection is inactive for more than 10s after the last sent request has been expired, it will try to reconnect. Before the connection is force restarted, an `error` level log with message `force_reconnecting_zombie_http_connection` is written. ## PR Checklist Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked: - [ ] Added tests for the changes - [ ] Added property-based tests for code which performs user input validation - [ ] Changed lines covered in coverage report - [x] Change log has been added to `changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md` files - [ ] For internal contributor: there is a jira ticket to track this change - [ ] Created PR to [emqx-docs](https://github.com/emqx/emqx-docs) if documentation update is required, or link to a follow-up jira ticket - [ ] Schema changes are backward compatible ## Checklist for CI (.github/workflows) changes - [ ] If changed package build workflow, pass [this action](https://github.com/emqx/emqx/actions/workflows/build_packages.yaml) (manual trigger) - [ ] Change log has been added to `changes/` dir for user-facing artifacts update

他会在检查到 http server 连接经过 N 次没有response 后，主动的断开重连。然后打印：force_reconnecting_zombie_http_connection

比如：这个用户就是典型场景：force_reconnecting_zombie_http_connection - #4，来自 jsonyu
然后他重新用 fastapi 的异步框架实现了 http server 后反馈没问题了。

tywin · 2025 年4 月 18 日 05:52

如果是http pipline 的问题，不是应该会导致线头阻塞吗，但是我的服务并没有出现这样的情况，这是为什么

zhongwencool · 2025 年4 月 18 日 06:54

这是个好问题，如果你真的想溯源到底的话，建议用tcpdump找到原因.我解释不了

tywin · 2025 年4 月 18 日 08:23

今天打开了日志，发现了比较奇怪的错误日志

{"time":1744956905766836,"level":"error","msg":"cluster_rpc_peers_lagging","mfa":"emqx_cluster_rpc:multicall/5(162)","tnx_id":52,"nodes":["emqx@10.244.65.29","emqx@10.244.13.206","emqx@10.244.0.52"],"status":"stopped_nodes","pid":"<0.40974330.2>"}
{"time":1744958026480040,"level":"error","msg":"http_connector_get_status_failed","mfa":"emqx_bridge_http_connector:do_get_status/2(535)","worker":"<0.112758912.0>","reason":"closed","pid":"<0.41042704.2>"}
{"time":1744958026483443,"level":"warning","msg":"alarm_is_activated","mfa":"emqx_alarm:do_actions/3(418)","pid":"<0.2683.0>","name":"emqx_authn_http:11126819","message":"resource down: closed"}
{"time":1744958026483811,"level":"warning","msg":"health_check_failed","mfa":"emqx_resource_manager:handle_connected_health_check/1(893)","status":"disconnected","pid":"<0.112761807.0>","id":"emqx_authn_http:11126819"}
{"time":1744958041484693,"level":"info","msg":"starting_http_connector","mfa":"emqx_bridge_http_connector:on_start/2(203)","connector":"emqx_authn_http:11126819","config":{"url":"http://iot-device-manager:41021/iotDevice/deviceAuth/mqttAuth","ssl":{"versions":["tlsv1.3","tlsv1.2"],"verify":"verify_peer","user_lookup_fun":"{fun emqx_tls_psk:lookup/3,undefined}","secure_renegotiate":"true","reuse_sessions":"true","log_level":"notice","hibernate_after":5000,"enable":"false","depth":10,"ciphers":[]},"request_timeout":5000,"pool_type":"random","pool_size":20,"method":"get","mechanism":"password_based","headers":{},"enable_pipelining":100,"enable":"true","connect_timeout":2000,"body":{"userName":"${username}","password":"******"},"base_url":{"scheme":"http","port":41021,"path":"/","host":"iot-device"},"backend":"http"},"pid":"<0.112761807.0>"}
{"time":1744958041485165,"level":"warning","msg":"emqx_connector_on_start_already_started","mfa":"emqx_bridge_http_connector:start_pool/2(256)","pool_name":"emqx_authn_http:11126819","pid":"<0.112761807.0>"}
{"time":1744958041489921,"level":"warning","msg":"alarm_is_deactivated","mfa":"emqx_alarm:do_actions/3(424)","pid":"<0.2683.0>","name":"emqx_authn_http:11126819"}

其中第一条记录中的三个节点并不存在，且我的emqx集群也已经运行了4个月了

zhongwencool · 2025 年4 月 18 日 09:22

应该是你的节点重启过，重启后他的 ip 也变了。就出现在新节点想同步旧节点的数据。发现旧节点已经不在了。他就一直想等旧的起来。

推荐做 2 个修改：

使用 hostname，而不是IP 作为 node 名（用 emqx@prod.dev 而不是emqx@10.244.13.206）：
类似于文档：常见安装部署问题解答 | EMQX文档中提到的使用 hostname
使用新版本，5.8.6，他修复了一下旧节点离开后，数据没有清除的 bug。
EMQX 开源版 v5 版本 | EMQX文档

#12843 修复了在执行 emqx ctl cluster leave 命令后，在复制节点上的 cluster_rpc_commit 事务 ID 清理程序。以前，未能适当清除这些事务 ID 阻碍了核心节点上的配置更新。

tywin · 2025 年4 月 18 日 09:48

运行了四个月，突然出现这个情况，这个core节点并没有重启记录，另外这个是使用emqx-operator-controller-manager管理的，似乎没有 hostname的配置

zhongwencool · 2025 年4 月 18 日 09:51

建议你去 operator 那个 github 上提问一下，他应该知道具体的步骤。

如果是用的 operator，他的 core 节点应该就是用的 hostname，那你那些 IP 的节点，应该是 replica。本来就不存数据的。不过我对 operator 的操作不懂，你还是去 github上问问吧。