k8s emqx集群创建失败

环境信息

  • EMQX 版本:5.0.8
  • 操作系统及版本:
  • 其他

问题描述

operator 以及emqx集群都有以下问题

配置文件及日志

    node {
      cookie = emqxsecretcookie
      data_dir = "data"
      etc_dir = "etc"
    }
    cluster {
      discovery_strategy = dns
      dns {
        record_type = srv
        name:"emqx-headless.apulis.svc.cluster.local"
      }
    }
    dashboard {
      listeners.http {
          bind: 18083
      }
      default_username: "admin"
      default_password: "public"
    }
    listeners.tcp.default {
      bind = "0.0.0.0:1883"
      max_connections = 1024000
    }
    sysmon.vm.long_schedule = disabled
2022-10-21T07:36:09.686002+00:00 [info] Ekka(AutoCluster): joining with 'emqx@emqx-2.emqx-headless.apulis.svc.cluster.local'
2022-10-21T07:36:09.686807+00:00 [info] event=client_process_not_found target="'emqx@emqx-2.emqx-headless.apulis.svc.cluster.local'" action=spawning_client
2022-10-21T07:36:09.686969+00:00 [debug] line: 61, mfa: gen_rpc_dispatcher:handle_call/3, msg: gen_rpc_start_client, target: 'emqx@emqx-2.emqx-headless.apulis.svc.cluster.local'
2022-10-21T07:36:09.687287+00:00 [debug] line: 38, mfa: gen_rpc_client_sup:start_child/1, msg: gen_rpc_starting_new_client, target: 'emqx@emqx-2.emqx-headless.apulis.svc.cluster.local'
2022-10-21T07:36:09.687771+00:00 [info] event=initializing_client driver=tcp node="emqx@emqx-2.emqx-headless.apulis.svc.cluster.local" port=5369
2022-10-21T07:36:09.689509+00:00 [debug] event=connect_to_remote_server peer="emqx@emqx-2.emqx-headless.apulis.svc.cluster.local" socket="#Port<0.90>" result=success
2022-10-21T07:36:09.689852+00:00 [debug] event=authentication_connection_succeeded socket="#Port<0.90>"
2022-10-21T07:36:09.690443+00:00 [error] event=authentication_reception_failed socket="#Port<0.90>" reason="closed"
2022-10-21T07:36:09.690587+00:00 [error] event=client_authentication_failed driver=tcp reason="{badtcp,closed}"
2022-10-21T07:36:09.690938+00:00 [error] Ekka(AutoCluster): Discover error: {case_clause,{badtcp,closed}}, [{mria,do_join,2,[{file,"mria.erl"},{line,354}]},{ekka_autocluster,discover_and_join,2,[{file,"ekka_autocluster.erl"},{line,161}]},{ekka_autocluster,'-discover_and_join/0-fun-0-',2,[{file,"ekka_autocluster.erl"},{line,109}]},{ekka_autocluster,'-run/1-fun-0-',1,[{file,"ekka_autocluster.erl"},{line,54}]}]
2022-10-21T07:36:09.690886+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.2854.0>, registered_name: [], exit: {{badtcp,closed},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,407}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.1806.0>], message_queue_len: 0, messages: [], links: [<0.1812.0>], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 29, reductions: 29080; neighbours:
2022-10-21T07:36:09.692592+00:00 [warning] Ekka(AutoCluster): discovery did not succeed; retrying in 5000 ms

检查一下 DNS 的返回

nslookup -type=srv emqx-headless.apulis.svc.cluster.local

另外

  1. 创建 emqx 的 yml 可以贴一个完整的么?
  2. operator 用的是哪个版本?

emqx yaml:

      mountPath: /tmp/fake
  livenessProbe:
    httpGet:
      path: /status
      port: 18083
    initialDelaySeconds: 60
    periodSeconds: 30
    failureThreshold: 10
  readinessProbe:
    httpGet:
      path: /status
      port: 18083
    initialDelaySeconds: 10
    periodSeconds: 5
    failureThreshold: 30
  startupProbe:
    httpGet:
      path: /status
      port: 18083
    initialDelaySeconds: 10
    periodSeconds: 5
    failureThreshold: 30
  lifecycle:
    preStop:
      exec:
        command: [ "/bin/sh","-c","emqx ctl cluster leave" ]
  # extraContainers:
  #   - name: extra
  #     image: busybox:stable
  #     command:
  #       - /bin/sh
  #       - -c
  #       - |
  #         tail -f /dev/null
  # initContainers:
  #   - name: busybox
  #     image: busybox

dashboardServiceTemplate:
metadata:
name: emqx-dashboard
spec:
selector:
apps.emqx.io/db-role: core
ports:
- name: “dashboard-listeners-http-bind”
protocol: TCP
port: 18083
targetPort: 18083
listenersServiceTemplate:
metadata:
name: emqx-listeners
spec:
ports:
- name: mqtt
protocol: TCP
port: 1883
targetPort: mqtt


operator: 2.0.1

这个没拷贝全吧

apiVersion: apps.emqx.io/v2alpha1
kind: EMQX
metadata:
  name: emqx
spec:
  image: "emqx/emqx:5.0.8"
  imagePullPolicy: IfNotPresent
  # imagePullSecrets: [fake-secrets]
  bootstrapConfig: |
    node {
      cookie = emqxsecretcookie
      data_dir = "data"
      etc_dir = "etc"
    }
    cluster {
      discovery_strategy = dns
      dns {
        record_type = srv
        name:"emqx-headless.apulis.svc.cluster.local"
      }
    }
    dashboard {
      listeners.http {
          bind: 18083
      }
      default_username: "admin"
      default_password: "public"
    }
    listeners.tcp.default {
      bind = "0.0.0.0:1883"
      max_connections = 1024000
    }
    sysmon.vm.long_schedule = disabled
  coreTemplate:
    metadata:
      name: emqx
      labels:
        apps.emqx.io/instance: emqx
        apps.emqx.io/db-role: core
      annotations:
    spec:
      replicas: 3
      nodeName:
      # nodeSelector:
      # affinity:
      # toleRations:
      command:
        - "/usr/bin/docker-entrypoint.sh"
      args:
        - "/opt/emqx/bin/emqx"
        - "foreground"
      ports:
        - name: mqtt
          containerPort: 1883
      env:
        - name: Foo
          value: Bar
      # envFrom:
      #   - configMapRef:
      #       name: fake-configmap
      resources:
        requests:
          memory: "64Mi"
          cpu: "125m"
        limits:
          memory: "1024Mi"
          cpu: "500m"
      podSecurityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        fsGroupChangePolicy: Always
      containerSecurityContext:
        runAsUser: 1000
        runAsGroup: 1000
      extraVolumes:
        - name: fake-volume
          emptyDir: { }
      extraVolumeMounts:
        - name: fake-volume
          mountPath: /tmp/fake
      livenessProbe:
        httpGet:
          path: /status
          port: 18083
        initialDelaySeconds: 60
        periodSeconds: 30
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /status
          port: 18083
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 12
      lifecycle:
        preStop:
          exec:
            command: [ "/bin/sh","-c","emqx ctl cluster leave" ]
      # extraContainers:
      #   - name: extra
      #     image: busybox:stable
      #     command:
      #       - /bin/sh
      #       - -c
      #       - |
      #         tail -f /dev/null
      # initContainers:
      #   - name: busybox
      #     image: busybox:stable
      #     securityContext:
      #       runAsUser: 0
      #       runAsGroup: 0
      #       capabilities:
      #         add:
      #         - SYS_ADMIN
      #         drop:
      #         - ALL
      #     command:
      #       - /bin/sh
      #       - -c
      #       - |
      #         mount -o remount rw /proc/sys
      #         sysctl -w net.core.somaxconn=65535
      #         sysctl -w net.ipv4.ip_local_port_range="1024 65535"
      #         sysctl -w kernel.core_uses_pid=0
      #         sysctl -w net.ipv4.tcp_tw_reuse=1
      #         sysctl -w fs.nr_open=1000000000
      #         sysctl -w fs.file-max=1000000000
      #         sysctl -w net.ipv4.ip_local_port_range='1025 65534'
      #         sysctl -w net.ipv4.udp_mem='74583000 499445000 749166000'
      #         sysctl -w net.ipv4.tcp_max_sync_backlog=163840
      #         sysctl -w net.core.netdev_max_backlog=163840
      #         sysctl -w net.core.optmem_max=16777216
      #         sysctl -w net.ipv4.tcp_rmem='1024 4096 16777216'
      #         sysctl -w net.ipv4.tcp_wmem='1024 4096 16777216'
      #         sysctl -w net.ipv4.tcp_max_tw_buckets=1048576
      #         sysctl -w net.ipv4.tcp_fin_timeout=15
      #         sysctl -w net.core.rmem_default=262144000
      #         sysctl -w net.core.wmem_default=262144000
      #         sysctl -w net.core.rmem_max=262144000
      #         sysctl -w net.core.wmem_max=262144000
      #         sysctl -w net.ipv4.tcp_mem='378150000  504200000  756300000'
      #         sysctl -w net.netfilter.nf_conntrack_max=1000000
      #         sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
  replicantTemplate:
    metadata:
      name: emqx-replicant
      labels:
        apps.emqx.io/instance: emqx
        apps.emqx.io/db-role: replicant
    spec:
      replicas: 1
      # nodeName:
      # nodeSelector:
      # affinity:
      # toleRations:
      command:
        - "/usr/bin/docker-entrypoint.sh"
      args:
        - "/opt/emqx/bin/emqx"
        - "foreground"
      ports:
        - name: mqtt
          containerPort: 1883
      env:
        - name: Foo
          value: Bar
      # envFrom:
      #   - configMapRef:
      #       name: fake-configmap
      # resources:
      #   requests:
      #     memory: "250Mi"
      #     cpu: "250m"
      #   limits:
      #     memory: "1024Mi"
      #     cpu: "500m"
      podSecurityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        fsGroupChangePolicy: Always
        supplementalGroups:
          - 1000
      containerSecurityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
      extraVolumes:
        - name: fake-volume
          emptyDir: { }
      extraVolumeMounts:
        - name: fake-volume
          mountPath: /tmp/fake
      livenessProbe:
        httpGet:
          path: /status
          port: 18083
        initialDelaySeconds: 60
        periodSeconds: 30
        failureThreshold: 10
      readinessProbe:
        httpGet:
          path: /status
          port: 18083
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 30
      startupProbe:
        httpGet:
          path: /status
          port: 18083
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 30
      lifecycle:
        preStop:
          exec:
            command: [ "/bin/sh","-c","emqx ctl cluster leave" ]
      # extraContainers:
      #   - name: extra
      #     image: busybox:stable
      #     command:
      #       - /bin/sh
      #       - -c
      #       - |
      #         tail -f /dev/null
      # initContainers:
      #   - name: busybox
      #     image: busybox
  dashboardServiceTemplate:
    metadata:
      name: emqx-dashboard
    spec:
      selector:
        apps.emqx.io/db-role: core
      ports:
        - name: "dashboard-listeners-http-bind"
          protocol: TCP
          port: 18083
          targetPort: 18083
  listenersServiceTemplate:
    metadata:
      name: emqx-listeners
    spec:
      ports:
        - name: mqtt
          protocol: TCP
          port: 1883
          targetPort: mqtt

这里你把它改回 emqx-core



operator controller 报错了

你过一会儿在apply 一下 test.yaml


emqx-core-0:

2022-10-24T02:38:53.987259+00:00 [debug] line: 141, mfa: emqx_retainer_mnesia:store_retained/2, msg: message_retained, topic: $SYS/brokers/emqx@emqx-core-0.emqx-headless.apulis.svc.cluster.local/version
2022-10-24T02:38:53.987819+00:00 [debug] line: 141, mfa: emqx_retainer_mnesia:store_retained/2, msg: message_retained, topic: $SYS/brokers/emqx@emqx-core-0.emqx-headless.apulis.svc.cluster.local/sysdescr
2022-10-24T02:38:53.988392+00:00 [debug] line: 141, mfa: emqx_retainer_mnesia:store_retained/2, msg: message_retained, topic: $SYS/brokers
2022-10-24T02:38:57.402784+00:00 [info] Ekka(AutoCluster): joining with 'emqx@emqx-core-0.emqx-headless.apulis.svc.cluster.local'
2022-10-24T02:38:57.402943+00:00 [debug] Ekka(AutoCluster): join result: ignore
2022-10-24T02:38:57.403074+00:00 [info] Ekka(AutoCluster): no discovered nodes outside cluster
2022-10-24T02:38:57.404449+00:00 [warning] Ekka(AutoCluster): discovery did not succeed; retrying in 5000 ms
2022-10-24T02:39:05.304673+00:00 [info] Ekka(AutoCluster): joining with 'emqx@emqx-core-0.emqx-headless.apulis.svc.cluster.local'
2022-10-24T02:39:05.304776+00:00 [debug] Ekka(AutoCluster): join result: ignore
2022-10-24T02:39:05.304875+00:00 [info] Ekka(AutoCluster): no discovered nodes outside cluster
2022-10-24T02:39:05.306059+00:00 [warning] Ekka(AutoCluster): discovery did not succeed; retrying in 5000 ms

emqx-core-1:

2022-10-24T02:40:03.769044+00:00 [info] Ekka(AutoCluster): joining with 'emqx@emqx-core-0.emqx-headless.apulis.svc.cluster.local'
2022-10-24T02:40:03.769421+00:00 [info] event=client_process_not_found target="'emqx@emqx-core-0.emqx-headless.apulis.svc.cluster.local'" action=spawning_client
2022-10-24T02:40:03.769506+00:00 [debug] line: 61, mfa: gen_rpc_dispatcher:handle_call/3, msg: gen_rpc_start_client, target: 'emqx@emqx-core-0.emqx-headless.apulis.svc.cluster.local'
2022-10-24T02:40:03.769711+00:00 [debug] line: 38, mfa: gen_rpc_client_sup:start_child/1, msg: gen_rpc_starting_new_client, target: 'emqx@emqx-core-0.emqx-headless.apulis.svc.cluster.local'
2022-10-24T02:40:03.769908+00:00 [info] event=initializing_client driver=tcp node="emqx@emqx-core-0.emqx-headless.apulis.svc.cluster.local" port=5369
2022-10-24T02:40:03.770789+00:00 [debug] event=connect_to_remote_server peer="emqx@emqx-core-0.emqx-headless.apulis.svc.cluster.local" socket="#Port<0.307>" result=success
2022-10-24T02:40:03.770936+00:00 [debug] event=authentication_connection_succeeded socket="#Port<0.307>"
2022-10-24T02:40:03.771210+00:00 [error] event=authentication_reception_failed socket="#Port<0.307>" reason="closed"
2022-10-24T02:40:03.771289+00:00 [error] event=client_authentication_failed driver=tcp reason="{badtcp,closed}"
2022-10-24T02:40:03.771405+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.3666.0>, registered_name: [], exit: {{badtcp,closed},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,407}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.1806.0>], message_queue_len: 0, messages: [], links: [<0.1812.0>], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 29, reductions: 29124; neighbours:
2022-10-24T02:40:03.771630+00:00 [error] Ekka(AutoCluster): Discover error: {case_clause,{badtcp,closed}}, [{mria,do_join,2,[{file,"mria.erl"},{line,354}]},{ekka_autocluster,discover_and_join,2,[{file,"ekka_autocluster.erl"},{line,161}]},{ekka_autocluster,'-discover_and_join/0-fun-0-',2,[{file,"ekka_autocluster.erl"},{line,109}]},{ekka_autocluster,'-run/1-fun-0-',1,[{file,"ekka_autocluster.erl"},{line,54}]}]
2022-10-24T02:40:03.772387+00:00 [warning] Ekka(AutoCluster): discovery did not succeed; retrying in 5000 ms

kubectl get pods -n apulis -o wide

我看下你起来了几个pod

我把name 改成 emqx-core后就正常了

emqx-core 在我这边还是不行

我改用了emqx直接部署集群,在statefulset中containerPort增加了5369端口,在emqx-headless中也加了5369端口,现在集群可以正常创建