Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue connecting to nodes that are not within the same cluster #658

Open
yxusnapchat opened this issue Oct 11, 2024 · 2 comments
Open

Issue connecting to nodes that are not within the same cluster #658

yxusnapchat opened this issue Oct 11, 2024 · 2 comments

Comments

@yxusnapchat
Copy link

Hi team I have a example based on the latest nv image nvcr.io/nvidia/tensorflow:24.07-tf2-py3 but run the mpi job on different nodes. However it complains that the launcher could not identify the worker. Is it supported to have launcher and worker running on separate nodes?

kind: MPIJob
metadata:
  name: xxx
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: xxx
            # env:
            #   - name: TF_USE_LEGACY_KERAS
            #     value: "1" 
            # resources:
            #   limits:
            #     nvidia.com/gpu: 1  # Request 1 GPU
            #   requests:
            #     nvidia.com/gpu: 1  # Optionally set requests equal to limits
            name: mpi-launcher
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /nvidia-examples/movielens-1m-keras-with-horovod.py
            - --mode=train
            - --model_dir="./model_dir" 
            - --export_dir="./export_dir"
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: xxx
            name: mpi-worker
            # env:
            #   - name: TF_USE_LEGACY_KERAS
            #     value: "1" 
            resources:
              limits:
                nvidia.com/gpu: 1  # Request 1 GPU
              requests:
                nvidia.com/gpu: 1  # Optionally set requests equal to limits

Also I am curious on where is the code pointer to start the worker. Thanks!

@alculquicondor
Copy link
Collaborator

Yes, it is supported.

You can find examples here https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1

@alculquicondor
Copy link
Collaborator

Please share more details about error messages in both launcher and worker pods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants