Unable to schedule two output buffers to be computed simultaneously when they are dependend. #6496
Replies: 11 comments 2 replies
-
Here's how to break the dependency to make compute_with legal:
You take your A -> B dependency, and turn it into A' -> A, A' -> B instead. |
Beta Was this translation helpful? Give feedback.
-
Thank you, this solves my example here. I have a few questions:
However, your hint allowed me to construct a minimal example where I hit a Halide bug (for which I opened the original issue, but couldn't reproduce it into a minimal example): class GeneratorBug : public Generator<GeneratorBug> {
public:
Input<Buffer<float>> input1{"input1", 2};
Output<Buffer<float>> output1{"output1", 1};
Output<Buffer<float>> output2{"output2", 1};
Output<Buffer<float>> output3{"output3", 1};
Var k{"kernel"};
Func intermediate{"intermediate"};
Func output1_value{"output1_value"};
Func output3_value{"output3_value"};
void generate() {
intermediate(k) = input1(k, 0) * input1(k, 1);
output1_value(k) = intermediate(k) * intermediate(k);
output1(k) = output1_value(k);
output2(k) = Halide::sqrt(output1_value(k) + output1_value(k));
output3_value(k) = input1(k, 0) + 2.0f;
output3(k) = output3_value(k);
Expr num = input1.dim(0).extent();
input1.dim(0).set_bounds(0, num);
input1.dim(1).set_bounds(0, 2);
output1.dim(0).set_bounds(0, num);
output2.dim(0).set_bounds(0, num);
output3.dim(0).set_bounds(0, num);
}
void schedule() {
// clang-format off
intermediate
.vectorize(k, 8)
.compute_at(output1_value, k)
.bound_storage(k, 8)
.store_in(MemoryType::Register)
;
output1_value // mmCholR
.vectorize(k, 8)
.compute_at(output2, k)
.bound_storage(k, 8)
.store_in(MemoryType::Register)
;
output1 // cholR
.vectorize(k, 8)
.compute_with(output2, k)
;
output2 // sqrtDetR
.vectorize(k, 8)
;
output3_value // mmCholInflatedRxxBlock
.vectorize(k, 8)
.compute_at(output3, k)
.bound_storage(k, 8)
.store_in(MemoryType::Register)
;
output3 // cholInflatedRxxBlock
.vectorize(k, 8)
.compute_with(output2, k)
;
// clang-format on
}
};
HALIDE_REGISTER_GENERATOR(GeneratorBug, generator_bug) Some extra info:
This generator crashes with:
I think this is a Halide bug, which is triggered by
|
Beta Was this translation helpful? Give feedback.
-
Yeah, this definitely looks like a bug, I will take a look. |
Beta Was this translation helpful? Give feedback.
-
Thank you for this fix! 😄 The Statement-file code looks good now. Also my main actual use case for my PhD research now successfully produces correct-looking code with all the loops merged like I wanted it. I'd still like to hear opinions on my earlier questions. The scheduling tricks with extra intermediate functions seems like a hack that ideally would not be necessary. |
Beta Was this translation helpful? Give feedback.
-
The crash should be fixed now.
Good question, right now it looks through IR to find valid sites for a function to be computed at. This is based on what loops there are in IR and since loops for output1 and output2 were merged (by doing
compute_with only uses relationships between functions to decide if two functions can be computed together. For example, in the original program, logic would be something like: output2 depends on output1, so output1 must be computed before output2 thus output1 and output2 can't be computed together. It's true that sometimes this can be too restrictive (like in your example) and these functions can in fact be computed in the same loop, but there is no support for something like that right now (I might be wrong, but for an arbitrary program, this looks like a really tricky analysis). |
Beta Was this translation helpful? Give feedback.
-
That's great to hear!
It's a hack for sure, but more advanced analysis of dependencies seems really tricky. One idea though, maybe, we could add an extra flag to compute_with which would tell the compiler to ignore some of the restrictions. Basically, this would allow users to say 'I guarantee that it's safe to compute s1 and s2 in the same loop even though s2 depends on s1' (I guess this would be similar in spirit to |
Beta Was this translation helpful? Give feedback.
-
The issue is that we don't know how to do bounds inference in the dependent case. The computation bounds of the Func with two consumers would have to be enough to support the other compute_with Func within the shared loop nest, and also enough in aggregate to fill in the output region its supposed to. We don't have any support for a Func having a region required at some position halfway through its loop nest, where a bunch of splits have been applied. I don't know how to write the code that would size those loops. Having the user guarantee that it's safe would be saying: hey, ignore the bounds required of output1 by output2, and assume that just computing enough of output1 to fill the output buffer will work ok. It's just dropping a dependency edge in our bounds inference DAG and crossing our fingers. That could cause out-of-bounds reads or garbage output depending on the offset between the two merged loop nests. It's not something a user can reasonably guarantee. |
Beta Was this translation helpful? Give feedback.
-
Looking back at this (and thinking about the envisioned FAQs efforts), I'm wondering if the |
Beta Was this translation helpful? Give feedback.
-
Using in() is a bit tricky for pipeline outputs, because it adds a wrapper Func for all consumers to call instead, but if it's the pipeline output... But yes, if this pattern were in the middle of the pipeline, you could use in(). If you're using JIT, you could also just realize whatever.in() instead of realizing whatever. |
Beta Was this translation helpful? Give feedback.
-
I see, that makes sense. So no elegant solution for now, except for the tmp function. |
Beta Was this translation helpful? Give feedback.
-
How do you generate the statement file you've shown in the top post above? I didnt see that described in the tutorials. |
Beta Was this translation helpful? Give feedback.
-
I'm trying to boil my problem down to a minimal example. Here is my generator:
Clearly, output2 depends on output1. I want everything to be processed in one for loop, with vectors of size 8. However, the statement file looks like this:
Clearly the two outputs are produced separately. Uncommenting either of the
compute_with
directives causes an error when running the generator:So far, I haven't been able to come up with a workaround. I think the
output1.compute_at(output2, k)
doesn't work because it's an output buffer, which is implicitely like.compute_root()
if I understand correctly. The.compute_with()
kinda incorrectly assess that it's impossible to schedule these together as there is a dependency.Beta Was this translation helpful? Give feedback.
All reactions