-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gcsfuse not working with custom container in Vertex AI on NVIDIA_L4 #1677
Comments
Hi @ronaldpanape, The "fuse device not found" error usually means the FUSE kernel module isn't loaded. Here's how to fix that:
Let me know if this resolves the issue. If not, please provide more details about your system and the command you were trying to run. Thanks, |
Hi @Tulsishah I am using Vertex AI, as seen below how do i get that working in that environment
|
Hi @ronaldpanape, I think you can add following in docker file and test
This issue seems like it might be related to Vertex AI, as gcsfuse require to install fuse first and fuse itself is not working here. If the above solution doesn't resolve the problem, I recommend raising an issue directly on the Vertex AI issue tracker for further investigation. Thanks, |
@ronaldpanape this looks like an issue in the VertexAI training VM, so I checked with the VertexAI training team. Copying Vertex AI team's response from internal portal:
I'll try to find out if the vertex AI team can directly respond to the issue somehow. |
@ronaldpanape the vertexAI team suggested that you use gcsfuse in vertexAI training using the documentation here and open a support ticket there, if it does not work, and then vertex AI team can actively support you. |
this works for me |
@gargnitingoogle is there any way users can configure in vertex AI the cache options from GCSFuse as stated here?, I am having data starvation issues on the multi-node distributed training setting in vertex AI pipelines, i.e Low GPU utilization in some nodes I wonder if GCSFuse can be installed in the container, the bucket is mounted and set an entrypoint with the cache file, stat and type options before the training runs the command in vertex AI |
@miguelalba96 the file cache feature that you pointed to, was added in gcsfuse v2.0.0, which is not used by the prebuilt images provided by Vertex AI yet (but it will be in a future version soon). So, if you are using managed vertex AI, then there is no way to do it as of today.
It depends on how gcsfuse is installed and configured in your setup. If you have control over those two parameters, then it might work out, which is not possible for vertex AI managed jobs.
I don't think it is possible in a managed vertex AI job and you might have to create your own custom container. When you install and configure GCSFuse yourself in a custom container, then remember to install gcsfuse v2.0.0 (instructions) and remember that only a privileged container is supported for making GCSFuse work (documented in troubleshooting). |
@miguelalba96 as you've opened a new issue #1830 for the same question, let's continue this discussion there. |
System (please complete the following information):
Vertex AI on NVIDIA_L4
Dockerfile setup
Error:
The text was updated successfully, but these errors were encountered: