gcsfuse not working with custom container in Vertex AI on NVIDIA_L4 #1677

ronaldpanape · 2024-02-01T16:05:54Z

System (please complete the following information):

Vertex AI on NVIDIA_L4

Dockerfile setup

# Set up the GCSFuse repository
#RUN export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s` && \
#    echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list && \
#    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -

# Install GCSFuse
#RUN apt-get update && \
#    apt-get install -y fuse gcsfuse

Error:

{"timestamp":{"seconds":1706802567,"nanos":766028008},"severity":"TRACE","message":"fuse_debug: Beginning the mounting kickoff process"}
{"timestamp":{"seconds":1706802567,"nanos":766076620},"severity":"TRACE","message":"fuse_debug: Parsing fuse file descriptor"}
{"timestamp":{"seconds":1706802567,"nanos":766087548},"severity":"TRACE","message":"fuse_debug: Preparing for direct mounting"}
{"timestamp":{"seconds":1706802567,"nanos":766127226},"severity":"TRACE","message":"fuse_debug: Directmount failed. Trying fallback."}
{"timestamp":{"seconds":1706802567,"nanos":766437137},"severity":"TRACE","message":"fuse_debug: Creating a socket pair"}
{"timestamp":{"seconds":1706802567,"nanos":766535546},"severity":"TRACE","message":"fuse_debug: Creating files to wrap the sockets"}
{"timestamp":{"seconds":1706802567,"nanos":766552608},"severity":"TRACE","message":"fuse_debug: Starting fusermount/os mount"}
/usr/bin/fusermount: fuse device not found, try 'modprobe fuse' first
{"timestamp":{"seconds":1706802567,"nanos":768124241},"severity":"ERROR","message":"Error while mounting gcsfuse: mountWithStorageHandle: Mount: mount: running /usr/bin/fusermount: exit status 1\n"}
mountWithArgs: mountWithStorageHandle: Mount: mount: running /usr/bin/fusermount: exit status 1

The text was updated successfully, but these errors were encountered:

Tulsishah · 2024-02-01T16:43:16Z

Hi @ronaldpanape,

The "fuse device not found" error usually means the FUSE kernel module isn't loaded. Here's how to fix that:

Install FUSE:
Ubuntu/Debian: sudo apt-get install fuse
CentOS/Fedora/RHEL: sudo yum install fuse
Load the module: sudo modprobe fuse
run container with --privilege mode.

Let me know if this resolves the issue. If not, please provide more details about your system and the command you were trying to run.

Thanks,
Tulsi Shah

ronaldpanape · 2024-02-01T19:19:24Z

Hi @Tulsishah

I am using Vertex AI, as seen below how do i get that working in that environment

    worker_pool_specs=[
        {
            "container_spec": {
                "image_uri": f"us-docker.pkg.dev/{project_id}/training/axolotl:v0.4.0",
                "command": ["/bin/bash", "-c"],
                "args": [full_command],
            },
            "replica_count": 1,
            "machine_spec": {
                "machine_type": machine_type,
                "accelerator_type": accelerator_type,
                "accelerator_count": accelerator_count,
            },
        }
    ],

Tulsishah · 2024-02-02T03:19:26Z

Hi @ronaldpanape,

I think you can add following in docker file and test

RUN apt-get install fuse && \
                modprobe fuse

This issue seems like it might be related to Vertex AI, as gcsfuse require to install fuse first and fuse itself is not working here. If the above solution doesn't resolve the problem, I recommend raising an issue directly on the Vertex AI issue tracker for further investigation.

Thanks,
Tulsi Shah

gargnitingoogle · 2024-02-06T14:31:27Z

@ronaldpanape this looks like an issue in the VertexAI training VM, so I checked with the VertexAI training team.

Copying Vertex AI team's response from internal portal:

Customers don't need to install GCSFuse themselves. As a managed training solution, Vertex AI Training installs and mounts GCSFuse for customers. Customers can directly read GCS in their containers. See this documentation for instructions: https://cloud.google.com/vertex-ai/docs/training/cloud-storage-file-system.

If the customers run into any further issues, please share with us the project number and job ID, our oncalls are more than happy to help.

I'll try to find out if the vertex AI team can directly respond to the issue somehow.

gargnitingoogle · 2024-02-07T05:26:51Z

@ronaldpanape the vertexAI team suggested that you use gcsfuse in vertexAI training using the documentation here and open a support ticket there, if it does not work, and then vertex AI team can actively support you.

ronaldpanape · 2024-02-07T18:01:11Z

this works for me

miguelalba96 · 2024-04-07T18:05:20Z

@gargnitingoogle is there any way users can configure in vertex AI the cache options from GCSFuse as stated here?, I am having data starvation issues on the multi-node distributed training setting in vertex AI pipelines, i.e Low GPU utilization in some nodes

I wonder if GCSFuse can be installed in the container, the bucket is mounted and set an entrypoint with the cache file, stat and type options before the training runs the command in vertex AI

gargnitingoogle · 2024-04-08T04:09:42Z

@miguelalba96 the file cache feature that you pointed to, was added in gcsfuse v2.0.0, which is not used by the prebuilt images provided by Vertex AI yet (but it will be in a future version soon). So, if you are using managed vertex AI, then there is no way to do it as of today.

is there any way users can configure in vertex AI the cache options from GCSFuse?

It depends on how gcsfuse is installed and configured in your setup. If you have control over those two parameters, then it might work out, which is not possible for vertex AI managed jobs.

I wonder if GCSFuse can be installed in the container ...

I don't think it is possible in a managed vertex AI job and you might have to create your own custom container. When you install and configure GCSFuse yourself in a custom container, then remember to install gcsfuse v2.0.0 (instructions) and remember that only a privileged container is supported for making GCSFuse work (documented in troubleshooting).

gargnitingoogle · 2024-04-08T08:41:58Z

@miguelalba96 as you've opened a new issue #1830 for the same question, let's continue this discussion there.

ronaldpanape added p1 P1 question Customer Issue: question about how to use tool labels Feb 1, 2024

ronaldpanape changed the title ~~using gcsfuse with Vertex AI on NVIDIA_L4~~ gcsfuse not working with custom container in Vertex AI on NVIDIA_L4 Feb 1, 2024

Tulsishah added pending customer action p2 P2 and removed p1 P1 labels Feb 4, 2024

github-actions bot removed the pending customer action label Feb 6, 2024

gargnitingoogle added the pending customer action label Feb 7, 2024

gargnitingoogle assigned ronaldpanape Feb 7, 2024

gargnitingoogle removed the question Customer Issue: question about how to use tool label Feb 7, 2024

ronaldpanape closed this as completed Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcsfuse not working with custom container in Vertex AI on NVIDIA_L4 #1677

gcsfuse not working with custom container in Vertex AI on NVIDIA_L4 #1677

ronaldpanape commented Feb 1, 2024 •

edited

Loading

Tulsishah commented Feb 1, 2024 •

edited

Loading

ronaldpanape commented Feb 1, 2024

Tulsishah commented Feb 2, 2024

gargnitingoogle commented Feb 6, 2024 •

edited

Loading

gargnitingoogle commented Feb 7, 2024

ronaldpanape commented Feb 7, 2024

miguelalba96 commented Apr 7, 2024 •

edited

Loading

gargnitingoogle commented Apr 8, 2024 •

edited

Loading

gargnitingoogle commented Apr 8, 2024

gcsfuse not working with custom container in Vertex AI on NVIDIA_L4 #1677

gcsfuse not working with custom container in Vertex AI on NVIDIA_L4 #1677

Comments

ronaldpanape commented Feb 1, 2024 • edited Loading

Tulsishah commented Feb 1, 2024 • edited Loading

ronaldpanape commented Feb 1, 2024

Tulsishah commented Feb 2, 2024

gargnitingoogle commented Feb 6, 2024 • edited Loading

gargnitingoogle commented Feb 7, 2024

ronaldpanape commented Feb 7, 2024

miguelalba96 commented Apr 7, 2024 • edited Loading

gargnitingoogle commented Apr 8, 2024 • edited Loading

gargnitingoogle commented Apr 8, 2024

ronaldpanape commented Feb 1, 2024 •

edited

Loading

Tulsishah commented Feb 1, 2024 •

edited

Loading

gargnitingoogle commented Feb 6, 2024 •

edited

Loading

miguelalba96 commented Apr 7, 2024 •

edited

Loading

gargnitingoogle commented Apr 8, 2024 •

edited

Loading