Possible issue with ggml_backend_sched #10780

danbev · 2024-12-11T11:57:04Z

danbev
Dec 11, 2024
Collaborator

Backend Scheduling Issue (maybe)

This is an issue I've run into when trying to get a multi-modal vision model to work with the new Vision API.

My goal with this was to get some experience with the new Vision API and also with a multi-modal model that uses cross-attention, like Llama 3.2 Vision, so that I can hopefully contribute to this part of the project in the future.

To get something working I looked at Ollama's support for Llama 3.2 Vision Instruct and the model they provide. They have two models, one for the language model and one for the vision encoder.
In our case I made the assumption that we only want one model so that that is what I opted for.

I wanted to follow the new Vision API and the Llava example that was provided in #9687. So I used the same image to try to reproduce the same/similar output.

The Issue

While developing/debugging the model I added a number of tensors that are copies of tensors used in the computation graph so that I could inspect their output if the original tensor gets resued by the backend schduler, which I think is something that it can do with tensors that are part of the graph. So this is a way to inspect the output of a tensor which might get reused by the backend scheduler.

So I added tensors like this:

    struct ggml_tensor * inp_raw_copy = ggml_new_tensor_4d(ctx0, GGML_TYPE_F32, image_size_width, image_size_height, n_channels, n_tiles);
    ggml_set_input(inp_raw_copy);
    ggml_set_name(inp_raw_copy, "inp_raw_copy");
    ggml_build_forward_expand(gf, inp_raw_copy);

Now, running the example with this code will produce a pretty resonable
output:

The image shows a close-up of the Eiffel Tower in Paris, France. The tower is
made of metal and has a dark gray color. It is shaped like a square with four
sides, and it has a flat top. The background of the image is a light gray color.

This might not be perfect but at least the image is described and the vision encoder produces something that the language model can also work with.

Now, if I comment out the code above, the output will be different. The output will be something like:

"The image shows a gray background..."

I initially thought it was because the image patch embeddings were not being generated correctly, but when I've checked the output image patch embeddings (uncommenting the code in encode_image_with_ca_vision) using:

$ sha256sum image_patch_embeddings.bin
319cc0572866e3b165d5f59dc3e5709b87ec503ff6af10e3cd487d21e2ad17ab  image_patch_embeddings.bin

The image patch embeddings are the same with this code commmented out or not, so it does not seem like removing this tensor has an effect on the image patch embeddings (the vision encoder).

The graph is allocated here, and then computed here

I also noticed that if I increase the number of layers that I offload to the GPU this also effect the output. For example, if I change the number of layers from 30 to 36 the I will also see the output above with the "gray background".

It seems to me like if I make a change to the computation graph of the vision model this can have an effect on the language model which I was not expecting (not saying it is wrong as I'm unfamiliar the inner workings of the backend
scheduler). Almost like the graph are shared but I was thinking that they would not be after calling ggml_backend_sched_reset(ctx.sched).

Does anyone recognize this issue, or have any ideas where I should start looking to try to figure this out?

The example contains the steps to convert the model, quantize it, and also run it.

slaren · 2024-12-11T12:19:26Z

slaren
Dec 11, 2024
Collaborator

Make sure that all the graph outputs are tagged with ggml_set_output, otherwise they may be overwritten.

4 replies

danbev Dec 11, 2024
Collaborator Author

@slaren Thanks, I've check the output and they look alright.

This got me thinking about looking at the graph and I realized a mistake I've made. I think there needs to be a flag like lctx.is_encoding where the language model is not loaded and only the vision model. Currently both are loaded. I totally missed this and will see if this fixes the issue 😞

slaren Dec 11, 2024
Collaborator

// TODO:(danbev) Removing this and one a bit further down causes the output
// to be incorrect. The output will be something like:

What's the other bit?

You can always run with the env variable GGML_SCHED_DEBUG=2 to see exactly how ggml_backend_sched is running your graph. This could also happen if you are using a tensor from a previous graph evaluation as an input to the next graph evaluation, since it would be overwritten by other tensors in the compute buffer.

danbev Dec 11, 2024
Collaborator Author

What's the other bit?

That was very unclear and it should have said "and the other tensor copy further down", which is this one.

You can always run with the env variable GGML_SCHED_DEBUG=2 to see exactly how ggml_backend_sched is running your graph.

Great, I'll try that 👍

So what I think is happening is that when the llama_context is created it will create a new ggml_backend_sched, and then call llama_build_graph which is this case will call build_mllama. The graph will then be passed to ggml_backend_sched_reserve.

The vision context is then created like this:

 // initialize vision context
    if (model->has_vision) {
        switch (model->arch) {
            case LLM_ARCH_MLLAMA:
                {
                    ctx->ca_vision.model = &model->ca_vision;
                    ctx->ca_vision.sched = ctx->sched.get();
                    const size_t max_nodes = llama_model_max_nodes(*model);
                    ctx->ca_vision.buf_compute_meta.resize(ggml_tensor_overhead()*max_nodes + ggml_graph_overhead_custom(max_nodes, false));
                }
                break;

The computation graph for the vision encoder will be built later using the same sched. I think might be what is causing the issues I'm seeing. I'll revisit this tomorrow and try to go through this in more detail.

slaren Dec 11, 2024
Collaborator

That's likely to cause the issue. The tensor data is stored in buffers allocated by ggml_backend_sched, and it is reused for every graph evaluation. Effectively, tensors allocated in the graph by ggml_backend_sched, become invalidated when allocating another graph. By adding dummy tensors, you probably made the compute buffer for the vision part large enough so that the output tensor is allocated beyond the space that the next graph needs, but you cannot rely on this.

danbev · 2025-01-05T15:03:35Z

danbev
Jan 5, 2025
Collaborator Author

@slaren Thanks for you help on this. Closing as resolved

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible issue with ggml_backend_sched #10780

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Possible issue with ggml_backend_sched #10780

danbev Dec 11, 2024 Collaborator

Backend Scheduling Issue (maybe)

The Issue

Replies: 2 comments · 4 replies

slaren Dec 11, 2024 Collaborator

danbev Dec 11, 2024 Collaborator Author

slaren Dec 11, 2024 Collaborator

danbev Dec 11, 2024 Collaborator Author

slaren Dec 11, 2024 Collaborator

danbev Jan 5, 2025 Collaborator Author

danbev
Dec 11, 2024
Collaborator

Replies: 2 comments 4 replies

slaren
Dec 11, 2024
Collaborator

danbev Dec 11, 2024
Collaborator Author

slaren Dec 11, 2024
Collaborator

danbev Dec 11, 2024
Collaborator Author

slaren Dec 11, 2024
Collaborator

danbev
Jan 5, 2025
Collaborator Author