Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated cluster specs #6

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from
Draft

Conversation

EduardDurech
Copy link

No description provided.

@martinjaggi
Copy link
Member

thanks a lot! tiny detail: where you linked to https://wiki.rcp.epfl.ch/home/CaaS/how-to-switch-between-rcp-caas-cluster-and-ic-caas-cluster the sentence is maybe missing the second part

also was someone able to check if the tutorial also works on the newer rcp-prod cluster which has a newer runAI version?

@EduardDurech
Copy link
Author

I'm using it on rcp-caas-prod but I think there's something on RCP's end as the job ends after container starts, I contacted them

$ runai config cluster rcp-caas-prod
$ runai login
> <authentication key>
$ python csub.py -n test05

> b'interactiveworkload.run.ai/test05 created\n'
>b''
>The following commands may come in handy: [...]

but after a few seconds the job fails

$ runai describe job test05
>
Name: test05

Namespace: runai-mlo-$<GASPAR>
Type: Interactive
Status: Failed
Duration: 22s
GPUs: 1.00
Total Requested GPUs: 1.00
Allocated GPUs: 0.00
Allocated GPUs memory: 0
Running PODs: 0
Pending PODs: 0
Parallelism: 1
Completions: 1
Succeeded PODs: 0
Failed PODs: 1
Is Distributed Workload: false
Service URLs:
Command Line: N/A

Pods:
POD STATUS TYPE AGE NODE
test05-0-0 ERROR INTERACTIVE 56s gpu009.rcp.epfl.ch/10.92.20.10

Events:
SOURCE TYPE AGE MESSAGE
-------- ---- --- -------
podgroup/pg-test05-0-1d3c22b7-b348-43c0-8610-c6c81b3e4d68 Normal 56s [Pending] Job status is Pending

runaijob/test05 Normal 56s [SuccessfulCreate] Created pod: test05-0-0

pod/test05-0-0 Normal 54s [Scheduled] Successfully assigned pod runai-mlo-$<GASPAR>/test05-0-0 to node gpu009.rcp.epfl.ch at node-pool default

podgroup/pg-test05-0-1d3c22b7-b348-43c0-8610-c6c81b3e4d68 Normal 53s [ContainerCreating] Job status changed from Pending to ContainerCreating

pod/test05-0-0 Normal 52s [AddedInterface] Add eth0 [172.16.8.211/32] from cilium

pod/test05-0-0 Normal 52s [Pulling] Pulling image "ic-registry.epfl.ch/mlo/mlo:v1"

pod/test05-0-0 Normal 37s [Pulled] Successfully pulled image "ic-registry.epfl.ch/mlo/mlo:v1" in 15.01s (15.01s including waiting)

pod/test05-0-0 Normal 37s [Created] Created container test05

podgroup/pg-test05-0-1d3c22b7-b348-43c0-8610-c6c81b3e4d68 Normal 35s [Running] Job status changed from ContainerCreating to Running

pod/test05-0-0 Normal 35s [Started] Started container test05

podgroup/pg-test05-0-1d3c22b7-b348-43c0-8610-c6c81b3e4d68 Normal 30s [Failed] Job status changed from Running to Failed

runaijob/test05 Warning 30s [BackoffLimitExceeded] RunaiJob has reached the specified backoff limit

rcp-caas-test there is this error

$ runai config cluster rcp-caas-test
$ runai login
> <authentication key>
$ runai list jobs
> ERRO[0000] no matches for kind "DistributedWorkload" in version "run.ai/v2alpha1"

$ python csub.py -n test05
>b''
>b'Error from server (PVC mlo-scratch does not exist.): error when creating "/mnt/c/Users/$<USERNAME>/AppData/Local/Temp/tmp0d4uib8d.yaml": admission webhook "workload-controller.runai.svc" denied the request: PVC mlo-scratch does not exist.\n'

I'll update once supportrcp responds

@haeggee
Copy link
Collaborator

haeggee commented Jul 10, 2024

hey, thanks a lot for the PR and the updates! :)

I think it's a bit problematic right now since the IC cluster and the new RCP-Prod updated to a newer runai, whereas RCP-test has not yet. Did you make sure it works on all three clusters?

@haeggee haeggee self-assigned this Jul 10, 2024
@EduardDurech
Copy link
Author

It can't work on all three, it can work on ic-caas and rcp-caas-prod though

@EduardDurech EduardDurech marked this pull request as draft July 11, 2024 14:15
@martinjaggi
Copy link
Member

it it works on the two newer clusters (newer runAI) then that would be enough actually. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants