Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Socket io fails in kubernetes cluster Name or service not known, can't connect to :9005 #4255

Open
sagojez opened this issue Nov 13, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@sagojez
Copy link

sagojez commented Nov 13, 2024

What happened
We're deploying the service with custom resource allocation, for reference our terraform files looks something like this:

resource "kubernetes_namespace" "fluvio_sys" {
  metadata {
    name = "fluvio-sys"
  }
}

# This is a direct reference to a copy of https://github.com/infinyon/fluvio/tree/master/k8-util/helm
resource "helm_release" "fluvio_sys" {
  name       = "fluvio-sys"
  chart     = "../../../../../helm/charts/fluvio-sys"
  version    = "0.12.1"
  namespace  = kubernetes_namespace.fluvio_sys.metadata[0].name
}


# Fluvio development cluster

resource "kubernetes_namespace" "fluvio_development_group" {
  metadata {
    name = "fluvio_development_group"
  }
}

resource "helm_release" "fluvio" {
  name       = "fluvio"
  chart     = "../../../../../helm/charts/fluvio-app"
  version    = "0.12.1"
  namespace  = kubernetes_namespace.fluvio_development_group.metadata[0].name

  set {
    name = "service.type"
    value = "ClusterIP"
  }
}

resource "kubernetes_manifest" "fluvio_spugroup_main" {
  manifest = {
    apiVersion = "fluvio.infinyon.com/v1"
    kind = "SpuGroup"

    metadata = {
      name = "main"
      namespace = kubernetes_namespace.fluvio_development_group.metadata[0].name
    }

    spec = {
      replicas = 1
    }
  }
}

resource "kubernetes_manifest" "fluvio_topic_events" {
  ...
}

resource "kubernetes_manifest" "fluvio_topic_dlq" {
  ...
}

However, when running fluvio cluster spu list I find that the public address is wrongly formatted, i.e. it shows under the Public Endpoint only the port :10000.

When using fluvio cluster start, I can see a proper address with a proper port (see image below). My assumption is that something is missing, however we don't see any option to set the hostname in the templates given for the k8s deployment.
Image

Expected behavior
I would expect the Public Endpoint to be either configurable or at least properly setted when trying to customize the resources via terraform.

Describe the setup

  • Are you using a local Fluvio install? Minikube? Fluvio Cloud? GKE
  • What version of Fluvio are you using? fluvio version 0.12.1

Log output

SPG:

�[2mfluvio_sc::k8::controllers::spu_service�[0m�[2m:�[0m k8 config: ScK8Config {
    image: "infinyon/fluvio:0.12.1",
    pod_security_context: Some(
        PodSecurityContext {
            fs_group: None,
            run_as_group: None,
            run_as_non_root: None,
            run_as_user: None,
            sysctls: [],
        },
    ),
    lb_service_annotations: {},
    service: Some(
        ServiceSpec {
            cluster_ip: "",
            external_ips: [],
            load_balancer_ip: None,
            type: Some(
                NodePort,
            ),
            external_name: None,
            external_traffic_policy: None,
            ports: [],
            selector: None,
        },
    ),
    spu_pod_config: PodConfig {
        node_selector: {},
        resources: Some(
            ResourceRequirements {
                limits: Object {
                    "memory": String("1Gi"),
                },
                requests: Object {
                    "memory": String("256Mi"),
                },
            },
        ),
        storage_class: None,
        base_node_port: Some(
            30005,
        ),
        extra_containers: [],
        extra_env: [],
        extra_volume_mounts: [],
        extra_volumes: [],
    },
}
...

SC:


2024-11-13T19:27:48.050287Z ERROR MetadataDispatcher{spec="SpuService" namespace="development-fluvio"}:process_ws_action: fluvio_stream_dispatcher::dispatcher::metadata: error: SpuService, applying Failure (422):Service "fluvio-spu-main-0" is invalid: spec.ports[0].nodePort: Invalid value: 30004: provided port is already allocated.
2024-11-13T19:27:58.013944Z ERROR fluvio_sc::k8::controllers::spu_service: error with inner loop: Custom {
    kind: TimedOut,
    error: "store timed out: SpuService Apply: fluvio-spu-main-0 loop: 2, timer: 10000 ms",
}
...
Socket error: Socket io failed to lookup address information: Name or service not known, can't connect to :9005"
@sagojez sagojez added the bug Something isn't working label Nov 13, 2024
@sagojez
Copy link
Author

sagojez commented Nov 15, 2024

In case anyone wants to work with this, here's a repository with the minimal set to reproduce the issue: https://github.com/sagoez/flv-scaffold/tree/main

@sagojez
Copy link
Author

sagojez commented Nov 28, 2024

After further investigation, we managed to identify the root cause of the issue. It appears that the following configuration is problematic:

  set {
    name = "service.type"
    value = "ClusterIP"
  }

That said, I believe the SC should be able to connect to the SPU without relying on a LoadBalancer. Is there a way to modify this behavior, or is it an intentional design choice?

@sehz
Copy link
Contributor

sehz commented Nov 28, 2024

Can you link to which helm chart values you are overriding?

The SC and SPU uses internal network configuration to talk to each other (which is different from public network (where client connects to).

SC never starts communication to SPU. It's SPU that initiates SC using private port. SC will only allow communication from registered list. The network configuration is out of scope for this repo since network configuration is more of deployment concern and will be different for each deployment architecture ( AWS, GCP, private data center). It is assumed that deployment operator will configure such that SPU can reach SC. The configuration in this repo is meant to only work on most simplistic scenario and that's only one will be supported.

You can test SC reachability from SPU running ping command within SPU pod. There are other tools out there that can help diagnose network configuration.

You can also reach out to [email protected] for commercial support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants