Unable to schedule two output buffers to be computed simultaneously when they are dependend. #6496

mcourteaux · 2021-10-29T14:43:46Z

mcourteaux
Oct 29, 2021
Collaborator

I'm trying to boil my problem down to a minimal example. Here is my generator:

class GeneratorBug : public Generator<GeneratorBug> {
   public:
    Input<Buffer<float>> input1{"input1", 2};
    Output<Buffer<float>> output1{"output1", 1};
    Output<Buffer<float>> output2{"output2", 1};

    Var k{"kernel"};

    Func intermediate{"intermediate"};

    void generate() {
        intermediate(k) = input1(k, 0) * input1(k, 1);
        output1(k) = intermediate(k) * intermediate(k);
        output2(k) = Halide::sqrt(output1(k) + output1(k));

        Expr num = input1.dim(0).extent();
        input1.dim(0).set_bounds(0, num);
        input1.dim(1).set_bounds(0, 2);
        output1.dim(0).set_bounds(0, num);
        output2.dim(0).set_bounds(0, num);
    }

    void schedule() {
        // clang-format off
        intermediate
            .vectorize(k, 8)
            .compute_at(output1, k)
            ;

        output1
            //.compute_root()
            .vectorize(k, 8)
            .compute_at(output2, k)
            //.compute_with(output2, k, LoopAlignStrategy::Auto)
            ;

        output2
            .compute_root()
            .vectorize(k, 8)
            //.compute_with(output1, k)
            ;
        // clang-format on
    }
};

HALIDE_REGISTER_GENERATOR(GeneratorBug, generator_bug)

Clearly, output2 depends on output1. I want everything to be processed in one for loop, with vectors of size 8. However, the statement file looks like this:

Clearly the two outputs are produced separately. Uncommenting either of the compute_with directives causes an error when running the generator:

Unhandled exception: Error: Invalid compute_with: there is dependency between output1 and output2

So far, I haven't been able to come up with a workaround. I think the output1.compute_at(output2, k) doesn't work because it's an output buffer, which is implicitely like .compute_root() if I understand correctly. The .compute_with() kinda incorrectly assess that it's impossible to schedule these together as there is a dependency.

abadams · 2021-10-29T18:29:39Z

abadams
Oct 29, 2021
Maintainer

Here's how to break the dependency to make compute_with legal:

        intermediate(k) = input1(k, 0) * input1(k, 1);
        output1_tmp(k) = intermediate(k) * intermediate(k);
        output1(k) = output1_tmp(k);
        output2(k) = Halide::sqrt(output1_tmp(k) + output1_tmp(k));
        ....
        output1.compute_with(output2, k);
        output1_tmp.compute_at(output2, k);

You take your A -> B dependency, and turn it into A' -> A, A' -> B instead.

0 replies

mcourteaux · 2021-11-01T11:30:20Z

mcourteaux
Nov 1, 2021
Collaborator Author

Thank you, this solves my example here. I have a few questions:

Why does output1_tmp.compute_at(output2, k) work, but output1_tmp.compute_at(output1, k) doesn't? To me, output1_tmp is used for both Func definitions.
What is the underlying reason .compute_with() was invalid?

However, your hint allowed me to construct a minimal example where I hit a Halide bug (for which I opened the original issue, but couldn't reproduce it into a minimal example):

class GeneratorBug : public Generator<GeneratorBug> {
   public:
    Input<Buffer<float>> input1{"input1", 2};
    Output<Buffer<float>> output1{"output1", 1};
    Output<Buffer<float>> output2{"output2", 1};
    Output<Buffer<float>> output3{"output3", 1};

    Var k{"kernel"};

    Func intermediate{"intermediate"};
    Func output1_value{"output1_value"};
    Func output3_value{"output3_value"};

    void generate() {
        intermediate(k) = input1(k, 0) * input1(k, 1);
        output1_value(k) = intermediate(k) * intermediate(k);
        output1(k) = output1_value(k);
        output2(k) = Halide::sqrt(output1_value(k) + output1_value(k));
        output3_value(k) = input1(k, 0) + 2.0f;
        output3(k) = output3_value(k);

        Expr num = input1.dim(0).extent();
        input1.dim(0).set_bounds(0, num);
        input1.dim(1).set_bounds(0, 2);
        output1.dim(0).set_bounds(0, num);
        output2.dim(0).set_bounds(0, num);
        output3.dim(0).set_bounds(0, num);
    }

    void schedule() {
        // clang-format off
        intermediate
            .vectorize(k, 8)
            .compute_at(output1_value, k)
            .bound_storage(k, 8)
            .store_in(MemoryType::Register)
            ;

        output1_value // mmCholR
            .vectorize(k, 8)
            .compute_at(output2, k)
            .bound_storage(k, 8)
            .store_in(MemoryType::Register)
            ;

        output1 // cholR
            .vectorize(k, 8)
            .compute_with(output2, k)
            ;

        output2 // sqrtDetR
            .vectorize(k, 8)
            ;

        output3_value // mmCholInflatedRxxBlock
            .vectorize(k, 8)
            .compute_at(output3, k)
            .bound_storage(k, 8)
            .store_in(MemoryType::Register)
            ;

        output3 // cholInflatedRxxBlock
            .vectorize(k, 8)
            .compute_with(output2, k)
            ;
        // clang-format on
    }
};

HALIDE_REGISTER_GENERATOR(GeneratorBug, generator_bug)

Some extra info:

I added a third output output3, that is unrelated to output1 and output2.
The computation from the third output uses a similar intermediate temp Func (output3_value). This is in this minimal example clearly not needed, but it is in my real use case.

This generator crashes with:

Unhandled exception: Internal Error at /home/martijn/w/3rd/Halide/src/BoundsInference.cpp:1182 triggered by user code at : Condition failed: f_args.size() == box.size():

I think this is a Halide bug, which is triggered by output3.compute_with(output2, k). I looked through the code, but I'm clueless. Running this in the debugger shows f_args = {"kernel"} while box.bounds = {}:

(gdb) p stage_name_to_func
$6 = std::map with 3 elements = {
    ["output1.s0"] = {contents = {strong = {ptr = 0x5555557d1350}, weak = 0x0, idx = 2}},
    ["output2.s0"] = {contents = {strong = {ptr = 0x5555557d1350}, weak = 0x0, idx = 4}},
    ["output3.s0"] = {contents = {strong = {ptr = 0x5555557d1350}, weak = 0x0, idx = 5}}
    }
(gdb) p b
$7 = {
    first = "output1.s0",
    second = {used = {<Halide::Internal::IRHandle> = {<Halide::Internal::IntrusivePtr<Halide::Internal::IRNode const>> = {ptr = 0x0}, <No data fields>}, <No data fields>}, bounds = std::vector of length 0, capacity 0}
    }
(gdb)

0 replies

vksnk · 2021-11-01T18:45:33Z

vksnk
Nov 1, 2021
Collaborator

I think this is a Halide bug, which is triggered by output3.compute_with(output2, k).

Yeah, this definitely looks like a bug, I will take a look.

0 replies

mcourteaux · 2021-11-02T15:57:42Z

mcourteaux
Nov 2, 2021
Collaborator Author

Thank you for this fix! 😄 The Statement-file code looks good now. Also my main actual use case for my PhD research now successfully produces correct-looking code with all the loops merged like I wanted it.

I'd still like to hear opinions on my earlier questions. The scheduling tricks with extra intermediate functions seems like a hack that ideally would not be necessary.

0 replies

vksnk · 2021-11-02T16:27:19Z

vksnk
Nov 2, 2021
Collaborator

The crash should be fixed now.

Why does output1_tmp.compute_at(output2, k) work, but output1_tmp.compute_at(output1, k) doesn't? To me, output1_tmp is used for both Func definitions.y

Good question, right now it looks through IR to find valid sites for a function to be computed at. This is based on what loops there are in IR and since loops for output1 and output2 were merged (by doing output1.compute_with(output2, k)), it only finds output2 as a legal site. I agree though that it should be possible to do output1_tmp.compute_at(output1, k) as well - I'll create a separate issue for that. (I am also wondering what happens when you have a transitive compute_with, like f1.compute_with(f2, k); f2.compute_with(f3, k); and at which of the three you should do compute_at?)

What is the underlying reason .compute_with() was invalid?

compute_with only uses relationships between functions to decide if two functions can be computed together. For example, in the original program, logic would be something like: output2 depends on output1, so output1 must be computed before output2 thus output1 and output2 can't be computed together. It's true that sometimes this can be too restrictive (like in your example) and these functions can in fact be computed in the same loop, but there is no support for something like that right now (I might be wrong, but for an arbitrary program, this looks like a really tricky analysis).

0 replies

vksnk · 2021-11-02T17:04:52Z

vksnk
Nov 2, 2021
Collaborator

Thank you for this fix! 😄 The Statement-file code looks good now. Also my main actual use case for my PhD research now successfully produces correct-looking code with all the loops merged like I wanted it.

That's great to hear!

The scheduling tricks with extra intermediate functions seems like a hack that ideally would not be necessary.

It's a hack for sure, but more advanced analysis of dependencies seems really tricky.

One idea though, maybe, we could add an extra flag to compute_with which would tell the compiler to ignore some of the restrictions. Basically, this would allow users to say 'I guarantee that it's safe to compute s1 and s2 in the same loop even though s2 depends on s1' (I guess this would be similar in spirit to allow_race_conditions()).

0 replies

abadams · 2021-11-02T17:13:10Z

abadams
Nov 2, 2021
Maintainer

The issue is that we don't know how to do bounds inference in the dependent case. The computation bounds of the Func with two consumers would have to be enough to support the other compute_with Func within the shared loop nest, and also enough in aggregate to fill in the output region its supposed to. We don't have any support for a Func having a region required at some position halfway through its loop nest, where a bunch of splits have been applied. I don't know how to write the code that would size those loops.

Having the user guarantee that it's safe would be saying: hey, ignore the bounds required of output1 by output2, and assume that just computing enough of output1 to fill the output buffer will work ok. It's just dropping a dependency edge in our bounds inference DAG and crossing our fingers. That could cause out-of-bounds reads or garbage output depending on the offset between the two merged loop nests. It's not something a user can reasonably guarantee.

0 replies

mcourteaux · 2021-11-30T19:40:23Z

mcourteaux
Nov 30, 2021
Collaborator Author

Here's how to break the dependency to make compute_with legal:

        intermediate(k) = input1(k, 0) * input1(k, 1);
        output1_tmp(k) = intermediate(k) * intermediate(k);
        output1(k) = output1_tmp(k);
        output2(k) = Halide::sqrt(output1_tmp(k) + output1_tmp(k));
        ....
        output1.compute_with(output2, k);
        output1_tmp.compute_at(output2, k);

You take your A -> B dependency, and turn it into A' -> A, A' -> B instead.

Looking back at this (and thinking about the envisioned FAQs efforts), I'm wondering if the .in() directive could be used to simulate breaking this dependency like you did.

0 replies

abadams · 2021-12-01T14:18:38Z

abadams
Dec 1, 2021
Maintainer

Using in() is a bit tricky for pipeline outputs, because it adds a wrapper Func for all consumers to call instead, but if it's the pipeline output...

But yes, if this pattern were in the middle of the pipeline, you could use in(). If you're using JIT, you could also just realize whatever.in() instead of realizing whatever.

0 replies

mcourteaux · 2021-12-01T14:56:50Z

mcourteaux
Dec 1, 2021
Collaborator Author

I see, that makes sense. So no elegant solution for now, except for the tmp function.

0 replies

cordovan66 · 2023-12-01T18:55:30Z

cordovan66
Dec 1, 2023

How do you generate the statement file you've shown in the top post above? I didnt see that described in the tutorials.
Thanks

2 replies

mcourteaux Dec 1, 2023
Collaborator Author

You can do it with the Generators (see tutorial part on Generators). You can use the emit flag -e with value html or conceptual_html. Conceptual is the stmt representation earlier in the lowering passes of the Halide compiler, which is more suited for working with parallel-fors, and GPU schedules.

cordovan66 Dec 1, 2023

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to schedule two output buffers to be computed simultaneously when they are dependend. #6496

{{title}}

Replies: 11 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Unable to schedule two output buffers to be computed simultaneously when they are dependend. #6496

mcourteaux Oct 29, 2021 Collaborator

Replies: 11 comments · 2 replies

abadams Oct 29, 2021 Maintainer

mcourteaux Nov 1, 2021 Collaborator Author

vksnk Nov 1, 2021 Collaborator

mcourteaux Nov 2, 2021 Collaborator Author

vksnk Nov 2, 2021 Collaborator

vksnk Nov 2, 2021 Collaborator

abadams Nov 2, 2021 Maintainer

mcourteaux Nov 30, 2021 Collaborator Author

abadams Dec 1, 2021 Maintainer

mcourteaux Dec 1, 2021 Collaborator Author

cordovan66 Dec 1, 2023

mcourteaux Dec 1, 2023 Collaborator Author

cordovan66 Dec 1, 2023

mcourteaux
Oct 29, 2021
Collaborator

Replies: 11 comments 2 replies

abadams
Oct 29, 2021
Maintainer

mcourteaux
Nov 1, 2021
Collaborator Author

vksnk
Nov 1, 2021
Collaborator

mcourteaux
Nov 2, 2021
Collaborator Author

vksnk
Nov 2, 2021
Collaborator

vksnk
Nov 2, 2021
Collaborator

abadams
Nov 2, 2021
Maintainer

mcourteaux
Nov 30, 2021
Collaborator Author

abadams
Dec 1, 2021
Maintainer

mcourteaux
Dec 1, 2021
Collaborator Author

cordovan66
Dec 1, 2023

mcourteaux Dec 1, 2023
Collaborator Author