protocol_error导致的设备离线问题排查

环境

  • EMQX 版本:4.4.19
  • 操作系统版本:ubuntu22.04

重现此问题的步骤

4G直连设备定时上报数据,上报频率大约为30min/组

预期行为

设备正常在线稳定上报数据

实际行为

  1. 大约每两小时出现一次离线情况
  2. webhook监听设备离线回调的reason为protocol_error
  3. 进一步log trace,日志如下:
    2024-09-24T11:00:25.846013+08:00 [debug] 128.244.12.0:14421 [MQTT] SEND PUBACK(Q0, R0, D0, PacketId=19, ReasonCode=16)
    2024-09-24T11:00:25.846144+08:00 [debug] 128.244.12.0:14421 [MQTT] RECV <<16,70,0,4,77,81,84,84,4,192,0,120,0,6,50,48,53,49,55,51,0,12,70,71,53,50,56,48,48,56,56,49,49,57,0,36,100,52,56,99,53,51,97,55,45,49,99,101,52,45,52,52,57,102,45,57,57,102,49,45,53,55,52,97,98,55,98,48,55,52,50,50>>
    2024-09-24T11:00:25.846202+08:00 [debug] 128.244.12.0:14421 [MQTT] RECV CONNECT(Q0, R0, D0, ClientId=205173, ProtoName=MQTT, ProtoVsn=4, CleanStart=false, KeepAlive=120, Username=xxx, Password=)
    2024-09-24T11:00:25.846260+08:00 [debug] 128.244.12.0:14421 [MQTT] Force to close the socket due to protocol_error
    2024-09-24T11:00:25.846356+08:00 [debug] 128.244.12.0:14421 [WebHook] Send to: “/iot/interactive/api/v1/emqxcallback”, params: {“username":“xxx”,“reason”:“protocol_error”,“node”:"mdemqxv2-659677b747-jcqlj@128.244.3.65”,“disconnected_at”:1727146825846,“connected_at”:1727139685549,“clientid”:“xxx”,“action”:“client_disconnected”}
    2024-09-24T11:00:25.847792+08:00 [debug] 128.244.12.0:14421 [MQTT] RECV <<16,70,0,4,77,81,84,84,4,192,0,120,0,6,50,48,53,49,55,51,0,12,70,71,53,50,56,48,48,56,56,49,49,57,0,36,100,52,56,99,53,51,97,55,45,49,99,101,52,45,52,52,57,102,45,57,57,102,49,45,53,55,52,97,98,55,98,48,55,52,50,50>>
    2024-09-24T11:00:25.847898+08:00 [debug] 128.244.12.0:14421 [MQTT] RECV CONNECT(Q0, R0, D0, ClientId=xxx, ProtoName=MQTT, ProtoVsn=4, CleanStart=false, KeepAlive=120, Username=xxx, Password=
    )
    2024-09-24T11:00:29.132230+08:00 [debug] 128.244.12.0:14421 [Auth http] Auth succeeded from path: /iotauth/api/v1/MdIotAuth, username: xxx
    2024-09-24T11:00:29.133175+08:00 [debug] 128.244.12.0:14421 [Channel] RECV CONNECT(Q0, R0, D0, ClientId=205173, ProtoName=MQTT, ProtoVsn=4, CleanStart=false, KeepAlive=120, Username=xxx, Password=)
    2024-09-24T11:00:44.133512+08:00 [warning] 128.244.12.0:14421 action: {takeover,‘begin’}
    msg: session_stepdown_request_timeout
    stale_channel: [{status,running},
    {message_queue_len,3},
    {current_stacktrace,
    [{emqx_cm,stale_channel_info,1,
    [{file,“emqx_cm.erl”},{line,389}]},
    {emqx_cm,request_stepdown,3,
    [{file,“emqx_cm.erl”},{line,362}]},
    {emqx_cm,takeover_session,2,
    [{file,“emqx_cm.erl”},{line,305}]},
    {emqx_cm,‘-open_session/3-fun-2-’,5,
    [{file,“emqx_cm.erl”},{line,258}]},
    {emqx_cm_locker,trans,3,
    [{file,“emqx_cm_locker.erl”},{line,55}]},
    {emqx_channel,process_connect,2,
    [{file,“emqx_channel.erl”},{line,535}]},
    {emqx_connection,with_channel,3,
    [{file,“emqx_connection.erl”},{line,700}]},
    {emqx_connection,process_msg,2,
    [{file,“emqx_connection.erl”},{line,403}]},
    {emqx_connection,process_msg,2,
    [{file,“emqx_connection.erl”},{line,409}]},
    {emqx_connection,handle_recv,3,
    [{file,“emqx_connection.erl”},{line,367}]},
    {proc_lib,wake_up,3,
    [{file,“proc_lib.erl”},{line,236}]}]}]
    stale_pid: <0.9542.4999>
    2024-09-24T11:00:52.748906+08:00 [debug] 128.244.12.0:11900 [Auth http] Auth succeeded from path: /iotauth/api/v1/MdIotAuth, username: xxx
    2024-09-24T11:00:52.750396+08:00 [debug] 128.244.12.0:11900 [Channel] RECV CONNECT(Q0, R0, D0, ClientId=205173, ProtoName=MQTT, ProtoVsn=4, CleanStart=false, KeepAlive=120, Username=xxx, Password=
    )
    2024-09-24T11:00:52.750541+08:00 [debug] 128.244.12.0:11900 [MQTT] SEND CONNACK(Q0, R0, D0, AckFlags=0, ReasonCode=2)
    2024-09-24T11:00:52.750707+08:00 [info] 128.244.12.0:11900 msg: terminate
    reason: {shutdown,client_identifier_not_valid}
    2024-09-24T11:01:13.607569+08:00 [debug] 128.244.11.0:64816 [Auth http] Auth succeeded from path: /iotauth/api/v1/MdIotAuth, username: xxx
    2024-09-24T11:01:13.608280+08:00 [debug] 128.244.11.0:64816 [Channel] RECV CONNECT(Q0, R0, D0, ClientId=205173, ProtoName=MQTT, ProtoVsn=4, CleanStart=false, KeepAlive=120, Username=xxx, Password=******)
    2024-09-24T11:01:13.608505+08:00 [debug] 128.244.11.0:64816 [MQTT] SEND CONNACK(Q0, R0, D0, AckFlags=0, ReasonCode=2)
    2024-09-24T11:01:13.608637+08:00 [info] 128.244.11.0:64816 msg: terminate
    reason: {shutdown,client_identifier_not_valid}
    2024-09-24T11:01:34.662998+08:00 [debug] 128.244.11.0:41969 [Auth http] Auth succeeded from path: /iotauth/api/v1/MdIotAuth, username: xxx

原因是设备上报了非法的消息
2024-09-24T11:00:25.846144+08:00 [debug] 128.244.12.0:14421 [MQTT] RECV <<16,70,0,4,77,81,84,84,4,192,0,120,0,6,50,48,53,49,55,51,0,12,70,71,53,50,56,48,48,56,56,49,49,57,0,36,100,52,56,99,53,51,97,55,45,49,99,101,52,45,52,52,57,102,45,57,57,102,49,45,53,55,52,97,98,55,98,48,55,52,50,50>>

这个报文解析结果是一个 CONNECT 连接请求

这个protocol_error 上下文都在的日志么?

正常来说设备上报数据包后平台会立即响应ack,什么问题会导致数据包响应ack延期10秒?

例如,设备认证访问了外部资源(例如,HTTP 认证、MySQL 认证等)