emq集群的节点会出现emqx_authn_http相关的告警

我的emqx集群部署在云服务器的K8s平台上,使用emqx-operator-controller-manager管理。

以下是我的yaml文件
apiVersion: apps.emqx.io/v2beta1
kind: EMQX
metadata:
name: emqx
namespace: real
spec:
image: emqx:5.5.0
coreTemplate:
spec:
replicas: 3
volumeClaimTemplates:
storageClassName: standard
resources:
requests:
storage: 10Gi
accessModes:
- ReadWriteOnce
extraVolumes:
- name: ssl-self-sign
secret:
secretName: ssl-self-sign
extraVolumeMounts:
- name: ssl-self-sign
mountPath: /mounted/cert
replicantTemplate:
spec:
replicas: 3
extraVolumes:
- name: ssl-self-sign
secret:
secretName: ssl-self-sign
extraVolumeMounts:
- name: ssl-self-sign
mountPath: /mounted/cert
dashboardServiceTemplate:
spec:
type: NodePort
ports:
- name: dashboard
nodePort: 30811
port: 18083
protocol: TCP
targetPort: 18083
listenersServiceTemplate:
spec:
type: NodePort
ports:
- name: ssl-default
nodePort: 32349
port: 8883
protocol: TCP
targetPort: 8883
- name: tcp-default
nodePort: 32350
port: 1883
protocol: TCP
targetPort: 1883
- name: ws-default
nodePort: 32347
port: 8083
protocol: TCP
targetPort: 8083
- name: wss-default
nodePort: 32348
port: 8084
protocol: TCP
targetPort: 8084
- name: tcp-program
nodePort: 32351
port: 1993
protocol: TCP
targetPort: 1993

我在EMQ的监控告警中看到了多条类似下面的告警信息
emqx_authn_http:6269953  resource down: closed emqx@10.244.13.254 系统 2025-04-17 01:28:22

reason: resource_down
resource_id: emqx_authn_http:11126819


想知道这是什么原因造成的

你点那个问号会有更详细的信息的。
一般如果没有 bug 的话,就是你的 http server 和 emqx 之前的连接不稳定。

详细信息就是这个
reason: resource_down
resource_id: emqx_authn_http:11126819
如果是连接不稳定,这种情况可能的原因是什么,这是云上同一个内网中的服务器,应该不存在延迟之类的问题,服务器的资源消耗不算大

我在日志中发现了以下的报错

2025-04-17T22:34:50.261151+08:00 [error] msg: http_connector_get_status_failed, mfa: emqx_bridge_http_connector:do_get_status/2(535), reason: closed, worker: <0.91569875.0>
2025-04-17T22:34:50.264632+08:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3(418), message: <<"resource down: closed">>, name: <<"emqx_authn_http:6287081">>
2025-04-17T22:34:50.264876+08:00 [warning] msg: health_check_failed, mfa: emqx_resource_manager:handle_connected_health_check/1(893), id: <<"emqx_authn_http:6287081">>, status: disconnected

在15秒后,有开始连接的日志

2025-04-17T22:35:05.265923+08:00 [info] msg: starting_http_connector, mfa: emqx_bridge_http_connector:on_start/2(203), config: #{ssl => #{depth => 10,verify => verify_peer,hibernate_after => 5000,enable => false,ciphers => [],versions => ['tlsv1.3','tlsv1.2'],log_level => notice,secure_renegotiate => true,reuse_sessions => true,user_lookup_fun => {fun emqx_tls_psk:lookup/3,undefined}},connect_timeout => 2000,mechanism => password_based,pool_size => 20,enable => true,body => #{password => <<"******">>,userName => <<"${username}">>},headers => #{},url => <<"http://iot-device:41***/iotDevice/deviceAuth/mqttAuth">>,method => get,backend => http,request_timeout => 5000,pool_type => random,enable_pipelining => 100,base_url => #{port => 41***,scheme => http,path => "/",host => "iot-device"}}, connector: <<"emqx_authn_http:6287081">>
2025-04-17T22:35:05.266193+08:00 [warning] msg: emqx_connector_on_start_already_started, mfa: emqx_bridge_http_connector:start_pool/2(256), pool_name: <<"emqx_authn_http:6287081">>
2025-04-17T22:35:05.270613+08:00 [warning] msg: alarm_is_deactivated, mfa: emqx_alarm:do_actions/3(424), name: <<"emqx_authn_http:6287081">>

有用户也遇到过这个问题,统计来看全部的原因都是http server 的实现对 pipline 的支持不够完全。同时发 100 个 request 过去时,有时会有一些请求一直没 response,而且是概率出来的,
过一段时间后,再分成 2 种情况:

  1. 这个链接就会完全不回应,且不断开。变成了僵尸进程。
  2. 这个链接会死掉(有一定的滞后性)。然后 emqx 会再次重连。这情况和你日志里面的很匹配。

但大部分客户都会坚信自己实现的 http server 不可能出问题(可能因为是概率的)。
所以 EMQX 在5.8.1(推荐升级到 5.8.6)推出了一个检查统计僵尸次数的东西:

他会在检查到 http server 连接经过 N 次没有response 后,主动的断开重连。然后打印:force_reconnecting_zombie_http_connection

比如:这个用户就是典型场景:force_reconnecting_zombie_http_connection - #4,来自 jsonyu
然后他重新用 fastapi 的异步框架实现了 http server 后反馈没问题了。

如果是http pipline 的问题,不是应该会导致线头阻塞吗,但是我的服务并没有出现这样的情况,这是为什么

这是个好问题,如果你真的想溯源到底的话,建议用tcpdump找到原因.我解释不了

今天打开了日志,发现了比较奇怪的错误日志

{"time":1744956905766836,"level":"error","msg":"cluster_rpc_peers_lagging","mfa":"emqx_cluster_rpc:multicall/5(162)","tnx_id":52,"nodes":["emqx@10.244.65.29","emqx@10.244.13.206","emqx@10.244.0.52"],"status":"stopped_nodes","pid":"<0.40974330.2>"}
{"time":1744958026480040,"level":"error","msg":"http_connector_get_status_failed","mfa":"emqx_bridge_http_connector:do_get_status/2(535)","worker":"<0.112758912.0>","reason":"closed","pid":"<0.41042704.2>"}
{"time":1744958026483443,"level":"warning","msg":"alarm_is_activated","mfa":"emqx_alarm:do_actions/3(418)","pid":"<0.2683.0>","name":"emqx_authn_http:11126819","message":"resource down: closed"}
{"time":1744958026483811,"level":"warning","msg":"health_check_failed","mfa":"emqx_resource_manager:handle_connected_health_check/1(893)","status":"disconnected","pid":"<0.112761807.0>","id":"emqx_authn_http:11126819"}
{"time":1744958041484693,"level":"info","msg":"starting_http_connector","mfa":"emqx_bridge_http_connector:on_start/2(203)","connector":"emqx_authn_http:11126819","config":{"url":"http://iot-device-manager:41021/iotDevice/deviceAuth/mqttAuth","ssl":{"versions":["tlsv1.3","tlsv1.2"],"verify":"verify_peer","user_lookup_fun":"{fun emqx_tls_psk:lookup/3,undefined}","secure_renegotiate":"true","reuse_sessions":"true","log_level":"notice","hibernate_after":5000,"enable":"false","depth":10,"ciphers":[]},"request_timeout":5000,"pool_type":"random","pool_size":20,"method":"get","mechanism":"password_based","headers":{},"enable_pipelining":100,"enable":"true","connect_timeout":2000,"body":{"userName":"${username}","password":"******"},"base_url":{"scheme":"http","port":41021,"path":"/","host":"iot-device"},"backend":"http"},"pid":"<0.112761807.0>"}
{"time":1744958041485165,"level":"warning","msg":"emqx_connector_on_start_already_started","mfa":"emqx_bridge_http_connector:start_pool/2(256)","pool_name":"emqx_authn_http:11126819","pid":"<0.112761807.0>"}
{"time":1744958041489921,"level":"warning","msg":"alarm_is_deactivated","mfa":"emqx_alarm:do_actions/3(424)","pid":"<0.2683.0>","name":"emqx_authn_http:11126819"}

其中第一条记录中的三个节点并不存在,且我的emqx集群也已经运行了4个月了

应该是你的节点重启过,重启后他的 ip 也变了。就出现在新节点想同步旧节点的数据。发现旧节点已经不在了。他就一直想等旧的起来。

推荐做 2 个修改:

  1. 使用 hostname,而不是IP 作为 node 名(用 emqx@prod.dev 而不是emqx@10.244.13.206):
    类似于文档:常见安装部署问题解答 | EMQX文档 中提到的使用 hostname

  2. 使用新版本,5.8.6,他修复了一下旧节点离开后,数据没有清除的 bug。
    EMQX 开源版 v5 版本 | EMQX文档

  • #12843 修复了在执行 emqx ctl cluster leave 命令后,在复制节点上的 cluster_rpc_commit 事务 ID 清理程序。以前,未能适当清除这些事务 ID 阻碍了核心节点上的配置更新。

运行了四个月,突然出现这个情况,这个core节点并没有重启记录,另外这个是使用emqx-operator-controller-manager管理的,似乎没有 hostname的配置

建议你去 operator 那个 github 上提问一下,他应该知道具体的步骤。

如果是用的 operator,他的 core 节点应该就是用的 hostname,那你那些 IP 的节点,应该是 replica。本来就不存数据的。不过我对 operator 的操作不懂,你还是去 github上问问吧。