[Bug]: Socket io fails in kubernetes cluster `Name or service not known, can't connect to :9005` #4255

sagojez · 2024-11-13T20:47:45Z

What happened
We're deploying the service with custom resource allocation, for reference our terraform files looks something like this:

resource "kubernetes_namespace" "fluvio_sys" {
  metadata {
    name = "fluvio-sys"
  }
}

# This is a direct reference to a copy of https://github.com/infinyon/fluvio/tree/master/k8-util/helm
resource "helm_release" "fluvio_sys" {
  name       = "fluvio-sys"
  chart     = "../../../../../helm/charts/fluvio-sys"
  version    = "0.12.1"
  namespace  = kubernetes_namespace.fluvio_sys.metadata[0].name
}


# Fluvio development cluster

resource "kubernetes_namespace" "fluvio_development_group" {
  metadata {
    name = "fluvio_development_group"
  }
}

resource "helm_release" "fluvio" {
  name       = "fluvio"
  chart     = "../../../../../helm/charts/fluvio-app"
  version    = "0.12.1"
  namespace  = kubernetes_namespace.fluvio_development_group.metadata[0].name

  set {
    name = "service.type"
    value = "ClusterIP"
  }
}

resource "kubernetes_manifest" "fluvio_spugroup_main" {
  manifest = {
    apiVersion = "fluvio.infinyon.com/v1"
    kind = "SpuGroup"

    metadata = {
      name = "main"
      namespace = kubernetes_namespace.fluvio_development_group.metadata[0].name
    }

    spec = {
      replicas = 1
    }
  }
}

resource "kubernetes_manifest" "fluvio_topic_events" {
  ...
}

resource "kubernetes_manifest" "fluvio_topic_dlq" {
  ...
}

However, when running fluvio cluster spu list I find that the public address is wrongly formatted, i.e. it shows under the Public Endpoint only the port :10000.

When using fluvio cluster start, I can see a proper address with a proper port (see image below). My assumption is that something is missing, however we don't see any option to set the hostname in the templates given for the k8s deployment.

Expected behavior
I would expect the Public Endpoint to be either configurable or at least properly setted when trying to customize the resources via terraform.

Describe the setup

Are you using a local Fluvio install? Minikube? Fluvio Cloud? GKE
What version of Fluvio are you using? fluvio version 0.12.1

Log output

SPG:

�[2mfluvio_sc::k8::controllers::spu_service�[0m�[2m:�[0m k8 config: ScK8Config {
    image: "infinyon/fluvio:0.12.1",
    pod_security_context: Some(
        PodSecurityContext {
            fs_group: None,
            run_as_group: None,
            run_as_non_root: None,
            run_as_user: None,
            sysctls: [],
        },
    ),
    lb_service_annotations: {},
    service: Some(
        ServiceSpec {
            cluster_ip: "",
            external_ips: [],
            load_balancer_ip: None,
            type: Some(
                NodePort,
            ),
            external_name: None,
            external_traffic_policy: None,
            ports: [],
            selector: None,
        },
    ),
    spu_pod_config: PodConfig {
        node_selector: {},
        resources: Some(
            ResourceRequirements {
                limits: Object {
                    "memory": String("1Gi"),
                },
                requests: Object {
                    "memory": String("256Mi"),
                },
            },
        ),
        storage_class: None,
        base_node_port: Some(
            30005,
        ),
        extra_containers: [],
        extra_env: [],
        extra_volume_mounts: [],
        extra_volumes: [],
    },
}
...

SC:


2024-11-13T19:27:48.050287Z ERROR MetadataDispatcher{spec="SpuService" namespace="development-fluvio"}:process_ws_action: fluvio_stream_dispatcher::dispatcher::metadata: error: SpuService, applying Failure (422):Service "fluvio-spu-main-0" is invalid: spec.ports[0].nodePort: Invalid value: 30004: provided port is already allocated.
2024-11-13T19:27:58.013944Z ERROR fluvio_sc::k8::controllers::spu_service: error with inner loop: Custom {
    kind: TimedOut,
    error: "store timed out: SpuService Apply: fluvio-spu-main-0 loop: 2, timer: 10000 ms",
}
...
Socket error: Socket io failed to lookup address information: Name or service not known, can't connect to :9005"

The text was updated successfully, but these errors were encountered:

sagojez · 2024-11-15T20:39:10Z

In case anyone wants to work with this, here's a repository with the minimal set to reproduce the issue: https://github.com/sagoez/flv-scaffold/tree/main

sagojez · 2024-11-28T18:47:15Z

After further investigation, we managed to identify the root cause of the issue. It appears that the following configuration is problematic:

  set {
    name = "service.type"
    value = "ClusterIP"
  }

That said, I believe the SC should be able to connect to the SPU without relying on a LoadBalancer. Is there a way to modify this behavior, or is it an intentional design choice?

sehz · 2024-11-28T20:06:58Z

Can you link to which helm chart values you are overriding?

The SC and SPU uses internal network configuration to talk to each other (which is different from public network (where client connects to).

SC never starts communication to SPU. It's SPU that initiates SC using private port. SC will only allow communication from registered list. The network configuration is out of scope for this repo since network configuration is more of deployment concern and will be different for each deployment architecture ( AWS, GCP, private data center). It is assumed that deployment operator will configure such that SPU can reach SC. The configuration in this repo is meant to only work on most simplistic scenario and that's only one will be supported.

You can test SC reachability from SPU running ping command within SPU pod. There are other tools out there that can help diagnose network configuration.

You can also reach out to [email protected] for commercial support.

sagojez added the bug Something isn't working label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Socket io fails in kubernetes cluster `Name or service not known, can't connect to :9005` #4255

[Bug]: Socket io fails in kubernetes cluster `Name or service not known, can't connect to :9005` #4255

sagojez commented Nov 13, 2024 •

edited

Loading

sagojez commented Nov 15, 2024

sagojez commented Nov 28, 2024

sehz commented Nov 28, 2024 •

edited

Loading

[Bug]: Socket io fails in kubernetes cluster Name or service not known, can't connect to :9005 #4255

[Bug]: Socket io fails in kubernetes cluster Name or service not known, can't connect to :9005 #4255

Comments

sagojez commented Nov 13, 2024 • edited Loading

sagojez commented Nov 15, 2024

sagojez commented Nov 28, 2024

sehz commented Nov 28, 2024 • edited Loading

[Bug]: Socket io fails in kubernetes cluster `Name or service not known, can't connect to :9005` #4255

[Bug]: Socket io fails in kubernetes cluster `Name or service not known, can't connect to :9005` #4255

sagojez commented Nov 13, 2024 •

edited

Loading

sehz commented Nov 28, 2024 •

edited

Loading